Abstract
Exploring the opportunities for incorporating Artificial Intelligence (AI) to support team problem-solving has been the focus of intensive ongoing research. However, while the incorporation of such AI tools into human team problem-solving can improve team performance, it is still unclear what modality of AI integration will lead to a genuine human–AI partnership capable of mimicking the dynamic adaptability of humans. This work unites human designers with AI Partners as fellow team members who can both reactively and proactively collaborate in real-time toward solving a complex and evolving engineering problem. Team performance and problem-solving behaviors are examined using the HyForm collaborative research platform, which uses an online collaborative design environment that simulates a complex interdisciplinary design problem. The problem constraints are unexpectedly changed midway through problem-solving to simulate the nature of dynamically evolving engineering problems. This work shows that after the unexpected design constraints change, or shock, is introduced, human–AI hybrid teams perform similarly to human teams, demonstrating the capability of AI Partners to adapt to unexpected events. Nonetheless, hybrid teams do struggle more with coordination and communication after the shock is introduced. Overall, this work demonstrates that these AI design partners can participate as active partners within human teams during a large, complex task, showing promise for future integration in practice.
1 Introduction
Artificial Intelligence (AI) in research, academia, and industry is becoming significantly more prevalent. A projection by McKinsey and Company [1] identified that by 2030, 70% of companies within the United States will have adapted one form of AI within their organizations. Furthermore, they predict that AI will contribute an additional global economic activity of around $13 trillion within the same period [1]. It is clear that AI will become embedded within the future workforce. To handle this technological and organizational shift, fundamental research is needed to address key challenges related to how humans and AI can most effectively interact, as well as to study different modalities of human–AI collaboration. Focus areas and challenges include effective communication strategies, context awareness, and trust, among others [2,3]. These open questions are leading researchers to dedicate efforts to the creation of different taxonomies and platforms in order to effectively study these characteristics and dimensions within an experimental context [4–6], including in engineering design settings.
Designing effective complex engineering systems is challenging. Engineers regularly need to manage and optimize coupled design parameters with interrelated factors and evolving constraints, making such design tasks both difficult and stressful [7–9]. Furthermore, such tasks can often be time-constrained, exacerbating these complexities [10]. To facilitate solving such complex design problems, researchers are studying the implementation of Artificial Intelligence (AI) assistance. For example, research efforts have explored the implementation of AI within various phases and roles of the engineering design process, including concept space exploration [11–13] and generation [14–20], concept evaluation [21,22], optimization [23,24], prototyping [25], manufacturing [26], and process management of teams [27,28]. Further, research involving 1500 companies found that human–AI collaboration resulted in considerable performance improvements [29,30].
Even with these improvements, researchers have identified scenarios during design tasks where AI assistance hinders team performance, especially bringing down the quality of high-performing teams [31]. Consequently, to effectively integrate AI within design teams and mitigate potential adverse effects, there is a critical need for a better understanding of how AI team members impact their human counterparts. The Human-Autonomy Teams (HAT) research community has studied factors contributing to more effective and enjoyable human–AI teaming. Factors like the level of autonomy [32–34], reliability [35,36], and team composition [37–39] have been shown to affect individual and team performance, problem-solving behaviors, member interactions, mental workload, and member experiences [40]. It is essential to develop a dynamic understanding not only of how these factors evolve over the design process but also of how they vary across scenarios and problem-solving contexts.
One particularly influential factor is the level of machine autonomy within the team, with simple reactive autonomy agents that pull information upon human requests believed to hinder performance [39,40]. Previous research on hybrid teams has studied the positive effects of autonomy agents with self-initiated proactive actions, which is regarded as a trait of high levels of autonomy [32–34]. In addition, multiple studies suggest that team performance could be improved by allowing autonomous agents to anticipate and push information proactively [41–43]. To date, little research in engineering design has studied this type of dynamic AI team member or how this type of human–AI interaction impacts team performance and problem-solving behaviors. To fill this extant research gap, the current research examines the integration of novel AI Partners within human engineering teams. These AI Partners dynamically adapt, responding both reactively and proactively to their human teammates in real-time. They not only provide suggestions of their own volition but also provide design improvement suggestions when requested by their human counterparts. These AI Partners also seek feedback from their teammates and update their behavior accordingly.
In prior work, the authors created the HyForm research platform to study human–AI hybrid teaming regarding team agility [22] and AI process management [27]. The current study utilizes HyForm and updates it with proactive and adaptive autonomous agents that communicate with other teammates bi-directionally. This work differs from prior studies that utilized the HyForm platform as it focuses on AI-as-Partner, as more detail defined and discussed in prior work in the domain [44], and differs from a similar prior work that focused on AI-as-Guide [27]. This study contributes to the body of HAT research by investigating not only the impact of proactive behaviors of autonomous agents in human–AI teaming but also the influence of the adaptability of the autonomous agents and the team composition in an engineering design setting.
The paper is organized as follows. Section 2 covers the background of the study. Section 3 introduces the research methodology, including the experimental design and the experiment platform. The experiment results are shown in Sec. 4. Section 5 presents our interpretation and discussion of the results. Lastly, Sec. 6 concludes our findings.
2 Background
Adaptation contributes to the overall agility of the team. Team agility is the ability of a design team to be robust in its capabilities, even when experiencing unexpected changes [45]. This research presents a novel setting that imposes the challenges of constantly evolving engineering tasks to a hybrid team composed of multiple highly autonomous and proactive AI agents and human members. In comparing and examining the problem-solving behavior to that of human-only teams, this research reveals areas of improvement for human–AI hybrid teams to be robust in their capabilities, even when experiencing unexpected changes, which is essential in the collaborative designing of engineering systems [9,45,46].
Previous studies have examined behaviors within hybrid teams when met with challenges set by specific task characteristics such as increasing human-autonomous agent outcome interdependence [47–49] or task difficulty [34,50]. This research uniquely imposes the challenge of an unexpected change in the task, namely constant shifts in constraints and objectives, which is common in complex engineering design tasks [9,45,46]. By doing so, it introduces agility as an essential factor in constructing effective hybrid teams and provides insight for human–AI teams to design engineering systems effectively.
The involvement of highly autonomous AI agents has been considered essential to building effective hybrid teams [33,34,40,51,52]. Features identified as autonomous include occupying a distinct role [53], a degree of interdependence with other team members' activities and outcomes [49,54], and a degree of agency involving independence of actions and proactivity among autonomous agent members [55,56]. Engineering design tasks can especially benefit from AI agents with a heightened level of autonomy because the task itself requires high interdependence amongst members with different roles [57,58]. In addition, it is important that other members conceive AI partners as genuine team partners for the team to effectively perform [34,40,50]. This research seeks to build an effective hybrid team by foremost developing novel, highly autonomous AI agents and involving them in a multi-agent, multi-human hybrid engineering team.
It has been shown that an AI agent's understanding of the individuality of fellow human team members enhances overall performance, trust, workload, and willingness to work with agents in the future [59], yet few works have integrated AI agents with this ability. This research develops AI agents that can account for each human team member's preference and go further by tracking changes in their preferences at each step to adapt their decisions accordingly. It is expected that the enhanced and dynamic awareness of the AI agent facilitates the team's ability to recoup after the abrupt change in the task.
Team composition has also been considered an important independent variable in forming hybrid teams [39,50,60]. The majority of HAT studies involve single autonomous agents paired with single or multiple human partners [40]. While some involve multiple AI agents and a single human team member [61], the hybrid team presented in this research is unique in that it consists of multiple agents and multiple humans. In doing so, the research emulates truly hybrid teams, where human and AI agents, with each of their distinct roles, are highly dependent on each other and require active coordination and communication to perform engineering design tasks effectively.
3 Methodology
This section first presents the HyForm platform and gives an overview of the experimental design and conditions. It then provides detailed information regarding the underlying frameworks of the AI Partners integrated within the design teams. The agents take on different roles. One agent emulates drone design specialist partners while another agent emulates an operations specialist partner. These AI Partners act and react dynamically to their human counterparts throughout the entire problem-solving process.
3.1 The HyForm Experimental Platform.
Researchers at Carnegie Mellon University (CMU) and Penn State University (PSU) jointly developed an open-source, experimental research platform called HyForm3 [9,46,62]. HyForm uses an online collaborative design environment that simulates a complex interdisciplinary design problem. The platform partners AI agents and human team members with specific design tasks to create and operate a fleet of drones to make deliveries to customers. Different types of deliveries result in certain revenues, and the goal of the team is to maximize overall profit. The HyForm platform captures some of the unique and essential features of the engineering design process, including heterogeneous team conditions with multiple disciplines, large solution space, keen response to customer needs and demands, and coordination among disciplines facing complex design constraints [9].
The HyForm platform allows researchers to track each distinct action and communication among team members [7,22,27]. Text-based communication channels allow team members within and across different team roles to exchange information. The team roles include Problem Manager, Design Specialist, and Operations Specialist; each team role operates using a distinct module within the HyForm platform, allowing them to work independently on their sub-tasks. The Problem Manager uses the business plan module to pick customers, determine the operation specialists' market, and select the most profitable plan developed by the team. Additionally, for the particular team structure in this work, the Problem Manager directs the team and facilitates communication between the Design Specialists and Operations Specialists. The Design Specialists use the drone design module to construct and evaluate drones with respect to their cost, range, velocity, and payload capabilities. Once a drone design is completed, it can be used by the Operations Specialists. The Operations Specialists use the operations module to develop and evaluate delivery routes for the customers, selected by the Problem Manager, using the drones created by the Design Specialists. The overall objective of the team was to maximize profit, which is exposed directly to the Problem Manager and the Operations Specialists. However, the Design Specialists deal with a multi-objective problem involving payload, cost, range, and speed. Their individual-level objective is to design drones optimized for both performance and cost.
HyForm allows researchers to reconfigure the communication channels between roles, enabling experimenters to restrict or extend team communication to study different team structures. The HyForm platform introduces transformations to the problem context (changing the customer maps, package types, timeline, etc.) enabling the ability to handle changes in the problem. Additionally, HyForm integrates AI design agents enabling the collaboration of human and AI agents in teams. The latter is the primary focus of this research, studying novel AI agents as active team partners.
3.2 Experimental Design.
The experiment is approved by the Institutional Review Board at CMU. Before the experiment, all participants read and sign a consent form. After the completion of the experiment, participants are compensated with a $20 Amazon gift card. For this experiment, 105 engineering undergraduate and graduate students are recruited from both CMU and PSU. The experiment is run virtually through the HyForm platform.
In this experiment, two conditions are used at the team composition level: (A) a human team and (B) a hybrid team composed of both human and AI Partners, as shown in Fig. 1. The principal difference between the two teams is the composition of the members. In the human team condition (Fig. 1(a)), all five participants in the team are human participants. Conversely, in the hybrid team condition (Fig. 1(b)), the team consists of three human participants and two AI Partners. The team communication structure remains the same for both conditions. In Fig. 1, the solid arrows indicate open lines of communication between team members, including the AI Partners for the hybrid team condition. In their communication, participants are aware of whether the counterpart is a human or AI team member to emulate an authentic teamwork environment, in which AI identity is unlikely to be blinded. Pilot testing also indicated that the participants were able to identify when they were working with AI or human teammates. The Problem Manager handles the exchange of information between the disciplines, two Design Specialists and two Operations Specialists. The Problem Manager was in charge of coordinating between these distinct roles, trading off competing operations plans, and ultimately selecting a final solution for the team. For this experiment, participants are randomly assigned to a role on one of the two team conditions. Random assignment avoids selection bias and solidifies the internal validity of the experiment design. Participants did not have previous experience with this design task, so it was not necessary to explicitly match their expertise. Complete data collection consists of 10 human teams and 14 hybrid teams.
3.3 Design Problem.
Collectively, teams are tasked to build, manage, and optimize a delivery drone fleet to maximize total profit. Each team member's role has specific responsibilities. Team performance is measured by the highest profit generated within a business plan submitted by a teams Problem Manager. This work introduces a drastic, unexpected change or “shock” to the design problem constraints for the second half of the experiment. After this shock is introduced, additional design constraints are implemented in the design and operations modules, as illustrated in Fig. 2. First, the Design Specialists must consider the geometrical constraints of the hangar space. Two walls are merged into the design module (left), restricting the dimensions of feasible drones. Second, Operation Specialists must consider a no-flight zone on the customer map. A large cylinder is included on the customer map (right), obstructing the flight of drones through this area.
It is important to note that the AI Partners in this work are not designed to handle the unexpected shock. The AI Design Partner does not have knowledge of the restricted sizing constraints, since its knowledge is based on the objective to optimize drone performance exclusive of size constraints. The human participants do have the ability to guide the AI Design Partner by referencing existing designs and entering commands based on performance metrics (more on this later). Also, resulting AI Design Partner designs can be modified by human participants to satisfy geometric constraints. Corresponding with the AI Design Partner, the AI Operations Partner does not have knowledge of the unexpected change. It uses an underlying Linear Programming algorithm to calculate delivery plans, which do not account for no-fly zones between customers. Similar to the AI Designer Partner designs, AI Operations Partner plans can be modified by human participants.
3.4 Design Study Timeline.
The experiment has six stages, chronologically illustrated in Table 1. The experiment starts with a pre-study session, where participants have 15 min to read and sign a consent form.
Experiment outline
Team Condition | Human | Hybrid |
---|---|---|
Pre-study (15 min) | Consent form and pre-study questionnaire | |
Training (10 min) | Guided tutorial for each team role | |
Session 1 (20 min) | Market 1 | |
Break (5 min) | Mid-study questionnaire | |
Session 2 (20 min) | Market 2 (Shock) | |
Post-session (10 min) | Post-study questionnaire |
Team Condition | Human | Hybrid |
---|---|---|
Pre-study (15 min) | Consent form and pre-study questionnaire | |
Training (10 min) | Guided tutorial for each team role | |
Session 1 (20 min) | Market 1 | |
Break (5 min) | Mid-study questionnaire | |
Session 2 (20 min) | Market 2 (Shock) | |
Post-session (10 min) | Post-study questionnaire |
Next, participants read the problem statement and complete a pre-study questionnaire. The problem statement details the design problem, the participants assigned role, and the team structure. Each role (Drone Specialist, Operations Specialist, and Problem Manager) has specific responsibilities and a unique problem statement. The pre-study questionnaire collects information on participants' backgrounds with respect to pertinent aspects of the experiment, such as building drones, business planning, and computer-aided design proficiency.
During the second stage, participants complete a 10-min training tutorial introducing them to their respective roles and the relevant tools in the platform by guiding them through the completion of a role-specific task. For example, Design Specialists are guided through building their first drone and evaluating its performance. The training session was designed through iterative rounds of testing and aligns closely with that used in prior HyForm studies. The third stage of the experiment is the first problem-solving session. Teams have 20 min to design drone fleets and delivery plans that maximize their profit for the first market. After 20 min, participants complete a mid-questionnaire, followed by a 3-min break.
Next, participants start the second problem-solving session. Like the first problem session, teams have 20 min to design drone fleets and delivery plans that maximize their overall profit. However, the two aforementioned shocks are introduced during this stage in the design problem. Finally, after completing the second problem session, participants complete a post-study questionnaire concluding the experiment. The mid- and post-study questionnaires include the NASA task load index (NASA-RTLX) survey [51,52,63,64] to evaluate participants' mental workload and cognitive experience during the task. A group of additional questions assesses teams' performance and problem-solving behaviors. These questions utilize a Likert-type scale to rate team characteristics, including overall team effort, goals, quality of work, collaboration, and communication [65–67].
3.5 Artificial Intelligence Partner Algorithm.
In the hybrid teams, two proactive and reactive AI Partner agents are created and integrated into the HyForm platform to replace one human Design Specialist and one human Operations Specialist. The communication with these AI Partners is restricted to the same chat channels that human team members use, in which the human team members can enter requests to their AI Partners in a valid syntax. The syntax is designed to achieve technical accuracy given the limited natural language processing capability of the AI Partners. The communication grammar corresponds with the human-human communications observed in prior HyForm studies [7,22,27]. The syntax permits team members to communicate drone/plan metrics (e.g., speed, payload, distance), share a preference direction (e.g., minimize or maximize), reference existing designs/plans (baselining), and provide feedback on responses (satisfaction with suggested designs). More precisely, the text grammar supports the following types of requests:
Want more or less of a certain metric;
Want a specific value of a certain metric;
Reference an existing design or plan as a baseline to change metrics;
Query the AI Partner for a new design or plan based on their current design state.
As shown in Fig. 3, team members can communicate with their AI Partners in two ways: either through direct chat channels (Fig. 3, top) or through a wizard (Fig. 3, bottom) that assists in assembling commands to their AI Partners in the proper syntax. The wizard allows team members to generate string-based commands that automatically satisfy the text grammar for communication with the AI Partners. Table 2 provides more detailed examples of valid chat messages for both the AI Design Partner and the AI Operations Partner.

Communication with AI Partners. The AI Partners monitor the chat channels where team members can either type (top) or use a wizard (bottom) to enter valid messages.
List of valid text grammar commands to the AI Partners
AI Design Partner Messages | AI Operations Partner Messages |
---|---|
want more range | want higher profit |
want higher capacity | want lower cost |
want lower cost | want more customers |
want range of “X” | want profit of “X” |
want more range than “X” | want more profit than “X” |
@design1 want more range | @plan1 want more profit |
@design1 want more range than “X” | @plan1 want more profit than “X” |
no | @plan1 want more profit than “X” northwest |
no |
AI Design Partner Messages | AI Operations Partner Messages |
---|---|
want more range | want higher profit |
want higher capacity | want lower cost |
want lower cost | want more customers |
want range of “X” | want profit of “X” |
want more range than “X” | want more profit than “X” |
@design1 want more range | @plan1 want more profit |
@design1 want more range than “X” | @plan1 want more profit than “X” |
no | @plan1 want more profit than “X” northwest |
no |
Figure 4 illustrates a step-by-step example query with the wizard, where the “example results” shows the output of the grammar in the proper syntax sent to the AI Partner. The wizard allows team members to request new designs with preference parameters (“want”), make updates to existing designs (“plan”), query the status of a design request (“ping”), and ask for general help (“help”). Once through the wizard, the query to the AI partner is sent through the chat with the proper grammar. The human partners can also respond with “no” or “unsatisfied” when their AI Partners ask for feedback.

An example query of the wizard with the AI Design Specialist Partner showing the resulting output in the proper grammar/syntax (“@r4_c0_$1590” refers to a specific drone design named by the team)
3.5.1 Artificial Intelligence Partner Algorithms.
The AI Partners use designs already created by team members in conjunction with information from chat messages to identify a target and preference direction for each team member in order to help suggest design solutions. The state of AI agents is updated based on user events, and the target and preference approach are based on previous visual steering methods in trade space exploration [68]. More precisely, the AI Partners use (1) targets to represent a team member's state based on the current design with respect to the performance space (either a drone design for a Design Specialist or a path plan for an Operations Specialist, and (2) preference directions formulated from the chat messages from each team member by extracting pertinent keywords (ex. more, less, or referencing an existing design). The AI Partners store individual targets and preference states for each team member, and, when queried by the human team members, perform a neighborhood search located at the target and proceed to return a suggested solution following the preference direction of the team member. In this way, the AI Partners can continually update team member states and adapt their suggested design solutions over time. Figure 5 qualitatively illustrates this process.
To support this adaptive design approach, a distance limit is placed on the amount of change from the target to a new suggested design/plan in the configuration space, based on a Levenshtein distance metric. This metric is used since both drones and path plans are stored as strings and not qualitative metrics [69]. Figure 6 provides an example of two drone designs with a Levenshtein distance of 4, where the second drone configuration has four additional “+'s, representing scaled-up motors. This demonstrates that even though the distance metric operates on an abstract string representation, the distance is still semantically meaningful. Prior work [9,22] describes HyForm's string representations for designs in more detail.

Example of Levenshtein Distance Between Two Drone Configurations. Each string represents the corresponding drone design.
When a query is sent to an AI Partner, the AI Partner uses the information from the query in combination with the previously described preference directions to provide support to the human teammates. Figure 7 shows the underlying logic that the AI Partners use to accomplish this process. All design solutions (drone designs or delivery plans) that the AI Partners return must satisfy the requested requirements. If the AI Partners cannot find a feasible design, they will return an “unsatisfied” chat message to their human teammate. The AI Partners also undergo an initial startup phase wherein if they do not have sufficient preference information for each design metric in the analysis (e.g., range, cost), they will ask for it. Team members can then either respond using the proper text grammar rules with their preferences and requirements or with no preference. After the startup phase, each team member's chat message that satisfies the text grammar initializes the AI Partners to return a satisfied design (drone design or delivery plan) or an unsatisfied request based on the AI Partner's analysis. This process repeats with each human-initiated query to the AI agent, demonstrating the reactive nature of the AI Partner.
Additionally, at the 9- and 18-min marks of each problem-solving session (recall each session is 20 min long), the AI Partners proactively provide a design solution to each team member based on their individualized saved preference state at that moment. Every time an AI Partner suggests a design, they save the design solution to their underlying team's database. This is displayed to other team members, and a notification chat message is sent to the team to notify them that a new design has been submitted. This demonstrates the proactive nature of the AI partner. In addition, please note that the AI partner algorithms are determined before the experiment and are not tuned or altered during the experiment.
3.5.2 Artificial Intelligence Design Partner.
The AI Design Partner, designed to act as a Design Specialist within the team, identifies target locations and preference directions within a drone metric trade space [68] defined by the velocity, range, capacity, and cost of solutions. To enable rapid AI Design Partner responses, a prepopulated trade space of drone configurations is sampled and used as a basis for the AI Design Partner to select and return designs. The pre-sampled database includes 1043 unique drone designs, with range, capacity, cost, and velocity metrics, generated using a character recurrent neural network (char-RNN) retraining of the drone design space [9,70]. The AI Design Partner algorithm then samples a subset of the pre-sampled database (M = 400) and calculates the Levenshtein distance [69], based on the string representation of the design configuration, for each sampled design compared to the current design. The AI Design Partner returns the design that minimizes the Levenshtein distance, while satisfying the tolerances for each metric based on the current team member's preference (Table 3).
AI Partner's Levenshtein distance
AI Design Partner | AI Operation Partner | |
---|---|---|
Preference | Metric: Range/Capacity/Cost | Metric: Profit/Cost/Customers |
More/Higher | min < AI Partner metric–Current metric < max | min < AI Partner metric–Current metric < max |
Less/Lower | min < Current metric–AI Partner metric < max | min < Current metric–AI Partner metric < max |
Same | | Current metric–AI Partner metric | < min | | Current metric–AI Partner metric | < min |
AI Design Partner | AI Operation Partner | |
---|---|---|
Preference | Metric: Range/Capacity/Cost | Metric: Profit/Cost/Customers |
More/Higher | min < AI Partner metric–Current metric < max | min < AI Partner metric–Current metric < max |
Less/Lower | min < Current metric–AI Partner metric < max | min < Current metric–AI Partner metric < max |
Same | | Current metric–AI Partner metric | < min | | Current metric–AI Partner metric | < min |
The lower bound tolerances for “More/Higher” and “Less/Lower” are applied so that the resulting design has a difference in metric performance compared to the current state (i.e., so that the AI Partner sufficiently addresses the request). Similarly, the upper bound is applied to stay within a neighborhood region (i.e., so that the AI Partner does not substantially depart from user preferences). As previously mentioned, if the AI Design Partner is unable to find a feasible design, an “unsatisfied” message is returned to the team through the chat.
3.5.3 Artificial Intelligence Operations Partner.
The AI Operations Partner, designed to act as an Operations Specialist within the team, identifies target locations and preference directions within a plan metric trade space defined by the profit, cost, and the number of customers served by each solution. To return a suggested plan based on a team member's request, the AI Operations Partner runs several rapid Linear Programming delivery path analyses, where each analysis has a different random combination of available drone designs already created by the Design Specialists. The resulting calculated plans are saved in string format and are evaluated by calculating the Levenshtein distance to the requesting team member's current plan. The AI Operations Partner returns the plan that minimizes the Levenshtein distance, while satisfying the tolerances for each metric based on the current team preference, as shown in Table 3. If the AI Operations Partner is unable to find a feasible plan, an “unsatisfied” message is returned to the human.
4 Results
The following section compares the hybrid and human teams in terms of overall performance (i.e., team profit), team behaviors, and team experiences. The former two are measured via data collected during the experiment sessions through HyForm, while the latter is gathered via responses from the mid- and post-study questionnaires. R version 4.0.1 was used to complete these analyses, and unless stated otherwise, all assumptions of the statistical tests were met.
4.1 Team Performance.
Team performance is measured using the overall profit achieved by each team during the problem sessions. While teams can submit multiple plans that result in multiple profits, only the plan with the highest profit is considered for the analysis. The maximum profit is tracked and averaged across all teams within a condition. Figure 8 shows the average maximum profit for each type of team by experimental session. To determine if the AI Partners influenced team performance, a two-sample Wilcox test is used. Overall, the two team conditions achieve similar levels of performance for both Session 1 (p-value = 0.333, W = 53) and Session 2 (p-value = 0.883, W = 73). The difference is statistically insignificant. This shows that the AI Partners are effective in replacing their human team counterparts in terms of average performance. An interesting note, however, is that the hybrid teams exhibit greater variance, particularly on the higher end, as seen in Fig. 8. This may indicate that in this study, hybrid teams have greater potential for better performance. However, future work and further investigation are needed to draw a decisive conclusion, as current data from this study is insufficient.
4.2 Team Problem-Solving Behavior.
Teams' problem-solving behaviors are analyzed using two metrics, communication count and action count. As mentioned in the previous section, team members using HyForm can only communicate through a chat text tool and complete specific design actions relevant to their specific roles. HyForm tracks and collects the data for these metrics over time, allowing complete reconstruction of the team processes to solve the problem. Further, previous work revealed tradeoffs between time allocated towards communication and taking action when designers experience an unexpected problem change [22]. In this work, a shock is introduced halfway through the experiment, the drone design and operations disciplines receive different constraints based on their sub-tasks (hangar dimension restrictions and no-flight zone, respectively). The communication count encompasses all textual messages sent from one team member to another, and the action count encompasses all distinct actions taken by a member within their design role. To determine if AI Partners influenced team communication, a two-sample Wilcox test is used.
When comparing the overall results combining both experiment sessions, the human teams communicate significantly more than the hybrid teams (p-value = 0.001, W = 431). After introducing the problem shock between problem-solving sessions, human teams reacted significantly and increased their communication, as presented in Fig. 9. When observing the experiment sessions separately, human teams show a significantly higher average communication count during Session 1 (p-value = 0.005, W = 117.5), while during Session 2, this trend is no longer the case (p-value = 0.095, W = 99).

Average team communication count: (a) overall team communication count, (b) design Specialist communication count, (c) operations Specialist communication count, and (d) problem manager communication count
Figures 9(b)–9(d) dive deeper into communication behaviors by comparing counts by team role during each experiment session. During the first problem-solving session, the Design Specialists (p-value = 0.025, W = 108.5) and Problem Manager (p-value = 0.03, W = 106.5) communicate significantly more in the human teams than the hybrid teams. This difference is not significant for the Operations Specialists (p-value = 0.35, W = 86.5). Similarly, after the shock is introduced (Session 2), the trend remains for both the Design Specialists (p-value = 0.05, W = 103.5) and Problem Manager (p-value = 0.010, W = 114), while there is no significant difference for the Operations Specialists (p-value = 0.93, W = 72). The Problem Managers in the human team condition react the most to the problem shocks by exhibiting the steepest increase in average communication count between sessions, as seen in Fig. 9(d), in comparison to Figs. 9(a), 9(b), and 9(c).
The second behavioral metric, action count, describes a different trend between the two team structures. To determine if AI Partners influenced the action taken by human designers, a two-sample Wilcox test is used. In this work, the action is assessed by the average number of design changes, regardless of what specific action is taken. Because the AI Partners in the hybrid teams cannot act in the same manner as their human counterparts, actions are only counted for human partners, and the values are normalized based on the number of human members in the team. Figure 10(a) shows the average action count per team condition by session, indicating no significant difference in the overall action count between team structures (Session 1: p-value = 0.792, W = 75; Session 2: p-value = 0.187, W = 93), with the human teams and hybrid teams acting similarly. Additionally, Figs. 10(b)–10(d) dive deeper into this problem-solving behavior, showing action counts for different team roles. There is no significant difference in Session 1 and Session 2 for Design Specialists (Session 1: p-value = 0.75, W = 76; Session 2: p-value = 0.43, W = 84), Operation Specialists (Session 1: p-value = 0.62, W = 61; Session 2: p-value = 0.57, W = 60), and Problem Managers (Session 1: p-value = 0.81, W = 74.5; Session 2: p-value =0.59, W = 60.5). Generally, the human teams tend to communicate more than and act similarly to the hybrid teams.

Average team action count: (a) overall average action count per team member, (b) Design Specialist action count, (c) Operations Specialist action count, and (d) Problem Manager action count
4.3 Hybrid Team Process Behaviors.
This section specifically studies the behaviors of the hybrid teams. Recall that hybrid teams consist of three human team members and two AI Partners—one being the AI Design Partner and the second being the AI Operations Partner. Figure 11 shows the proportion of total communication count between human-human and human–AI. Communication is equally proportioned within the first problem-solving session between human-human discourse (46.6%) and human–AI discourse (53.3%). However, after the shock, this behavior radically changes. Communication shifts more heavily to human–AI communication, almost tripling to nearly 80.5% of the overall communication.
As mentioned previously in Sec. 3, both AI Partners monitor the chat channels to collaborate with their human teammates. It is restricted by a grammar structure, where team members enter requests in a validated format. Figure 12 shows the possible types of requests from human designers to their AI Partners for each team role for both experiment sessions. For the first experiment session, all team roles communicate with a similar frequency with their AI Partners. However, after the shock, there is a drastic increase in the communications with their AI Partners for both the Operation Specialists and Problem Manager roles. Specifically, there is a medium effect size (Cohen's d = 0.66) in communication frequency between sessions, indicating a practically significant increase in communication. Design Specialists focus evenly on requesting new designs (“Want [New Design]” 46.20%) and design updates (“Want [Design Update]” 34.38%), and Problem Managers focus primarily on requesting design updates (“Want [Design Update]” 76.47%). Contrary, Design Specialists do not request new or updated design solutions. Instead, Design Specialists increase their inquiry on the design status (“Ping [Design Status]”) of the AI Partners.
4.4 Team Members' Experience.
This section analyses the questionnaires taken by team members following each experiment session. The first part of the questionnaire uses a version of the NASA-RTLX [63,64], modified per Nolte and McComb [71] to better align with cognitive experience, to acquire insights into participants' cognitive demand and workload perceptions while working throughout the design problem. To determine if AI Partners influenced participants' cognitive demand and workload perceptions, a two-way ANOVA is used. The second part of the questionnaire asks participants about team performance, and team cohesion features, including overall team productivity, effort, and whether the team came to a consensus effectively. This includes modified versions of the short Trust Perception Scale-HRI [66], the Team Effectiveness Instrument [65], and an effectiveness survey developed by the researchers of this study using the results of a previous study [67]. To determine if AI Partners influenced participants' team perceptions, a Wilcox test is used.
4.4.1 Cognitive Experience and Mental Workload.
The first part of the questionnaire appraises participants' mental workload and cognitive experience. Mental workload scores are calculated for each participant by summing their scores for the following sub-dimensions that are individually rated: mental demand, temporal demand, performance, effort, and frustration. The sum of these is averaged by dividing by the number of sub-dimensions (i.e., five). Similarly, cognitive experience scores are calculated for each participant by summing their scores for all the modified NASA-RTLX sub-dimensions (mental demand, temporal demand, performance, effort, stress, discouragement, insecurity, and frustration) and dividing by the number of sub-dimensions (i.e., eight). For both of these global dimensions, the inverse of the performance sub-dimension is taken before summation to match the scaling of the others.
Figure 13 shows the overall cognitive experience (Fig. 13(a)) and the mental workload (Fig. 13(b)) measurements by team role for each team condition. The results of the overall cognitive experience and the mental workload present no significant difference when comparing the design roles and team structures even after the shock introduced in Session 2. However, diving into the underlying sub-dimensions of these NASA-RTLX measures, the Problem Manager role in both team conditions faces a significant increase in mental demand (p = 0.032, W = 347.5) when transitioning from Session 1 to Session 2. Additionally, there is a significant difference in frustration between the human and hybrid teams (p = 0.032, W = 2833). The Operations Specialists face the lowest variability in the NASA-RTLX measurements across team roles for both the human and hybrid teams.
4.4.2 Team Effectiveness and Cohesion.
The second part of the questionnaire measures and evaluates participants' perceptions of their team's effectiveness and interactions critical to the team's success. Team members' answers are recorded using a Likert-type scale representing seven discrete options, from “strongly disagree/inaccurate” to “strongly agree/accurate.”
Figures 14 and 15 present the results from these questionnaires. During Session 2, members of the hybrid teams perceived their teams as less efficient (p = 0.088, W = 779.5), and team roles were not clear (p = 0.0202, W = 706.5) when compared to human teams. They also identify significantly less effective feedback (p = 0.028, W = 721.5), significantly less equal participation (p = 0.0099, W = 676), and significantly less collaboration (p = 0.0076, W = 666.5) efforts from their teammates. Furthermore, hybrid teams perceived that their team communicated significantly less effectively during both experiment sessions (Session 1: p = 0.046, W = 744.5; and Session 2: p = 0.013, W = 687). The results from the questionnaire highlight a negative perception of team members regarding the effectiveness and cohesion within the hybrid teams, even though they performed equivalently well.

Team effectiveness perception: (a) team is efficient, (b) team communicates effectively, (c) team members are cooperative, and (d) members are clear about their roles
5 Discussion
This work studies the performance and problem-solving behaviors of hybrid-partnered engineering design teams compared with human-only design teams, with a drastic, unexpected shock introduced midway through the design task. The human teams are composed of five human members with specific distinct roles who can communicate and share solutions. Similarly, the hybrid teams are composed of three human members and two AI Partners who can communicate and share solutions between team members. It is believed that human and AI collaboration could yield considerable improvements in overall performance compared to human-only teams [29,30]. However, the engineering design community is still researching factors influencing human–AI teaming effectiveness and enjoyment. This study adds to the body of research by introducing a unique AI Partner that can dynamically interact, both reactively and proactively, in real-time with their human counterparts to emulate genuine team partners. The anticipatory and proactive information-pulling behavior of AI Partners in team interactions is expected to improve team performance [41–43]. The impacts of these AI Partners on both performance and problem-solving behaviors are explored in the context of engineering teams solving complex, interdisciplinary design problems.
Both human and hybrid teams perform similarly. The overall performance of hybrid teams is an indication that the AI members' skill level may be comparable to the human members of the team [52–54]. This supports the design of the AI Partners created for this research, by showing that they can adeptly stand in as substitutes for human partners as opposed to simply being assistants [22,33,52]. While there is not a significant difference between human-only and hybrid teams in overall performance, the hybrid teams do exhibit greater variability in terms of their team performance, with a long tail of high-profit teams. These results indicate that hybrid teams demonstrate high potential for better performance and that the high capability of AI Partners can be a contributing factor. Other works have shown that possible mechanisms behind team performance can be information-processing functions [40] or compatibility with human partners [66].
It is also significant that the hybrid teams maintain a comparable performance across sessions, regardless of the AI Partners being blind to the abrupt changes in constraints. It is likely that the increase in communication between human participants and AI Partners supported the consistent level of performance in hybrid teams. This result demonstrates that the adaptivity characteristic of human–AI hybrid teams can come from the human side of the partnership, instead of developing the perfect adaptive AI Partners. It also presents the possibility of hybrid teams largely outperforming human-only teams with certain team dynamics such as effective communication [56]. This insight is potentially of great importance as a principle for forming teams in dynamic scenarios, in which the human portion of the team should be only as large as necessary to handle adaptation.
In addition to comparing performance levels, a critical aspect of this work is to explore how the integration of these AI Partners fundamentally impacts the teams' problem-solving behaviors. The underlying difference between human and hybrid teams is present in their overall communication, while there is almost no difference in the overall action behavior. Previous literature on the design thinking process indicates the presence of an increase in team communication at the beginning of the collective design process, as the team is clarifying goals and establishing a common understanding of the design requirements [72]. A similar pattern could be observed in this study when the shock was introduced, which altered the design requirements and effectively propelled the team to communicate toward goal clarification. Examining the communication behavior at a role level reveals that Problem Managers in human teams exhibit the steepest increase in communication, particularly after the problem shock is introduced prior to the second problem session. As the Problem Managers are directly connected to all other individuals within the team, they serve as the central node of the team structure [73,74]. The upsurge in communication could be caused by the centrality that these managers have in both the communication structure (the need to communicate information across disciplines) and in the task structure (the need to oversee and submit final plans), as they mediate and transform the flow of information for other team members [75]. These critical efforts for information flow may have been more effective for responding to the change in the problem. As shown in the results, the human team members respond to the new constraints added to the problem by increasing their communication efforts, especially at the management level. This result opens a question of whether limitations within the communication between human and AI hinder the adaptation of the hybrid teams to the abrupt shock. For instance, for the human participants in this study, knowing the fact that their fellow teammate is an AI could have deterred their willingness to communicate. Works examining human–computer and human–robot interaction suggest that human–AI communication experiences reduced interpersonal attraction that could affect user perception [76,77], while such limitations are typically not present in human-human collaboration. This is an important consideration that must be accounted for in the long-term performance of these teams.
Analysis of communication frequency and distribution reveals problem-solving behaviors related to interdependence within the hybrid teams. During the first session, we observed that the communication between human members and human–AI members is equally distributed. However, following the shock, communication shifts much more heavily among human–AI members, particularly for the Operation Specialists and Problem Manager roles. On the contrary, Design Specialists show a slight increase in human–AI communication. We can infer that human team members seem to rely more on their AI Partners under uncertain shocks, especially for operation-related tasks rather than for design tasks which aligns with results from similar previous studies [78]. Results of this work point to the significance of communication for the success of human–AI teams. Accordingly, it would be critical in future research to analyze the quality of communication [79]. For instance, patterns in the content of the chat can be analyzed using natural language processing methods to examine what types of information are being transferred between members in a hybrid team. Such analysis can further determine if the relationship between communication and actions in hybrid teams restricts team performance.
Even though the AI Partners show effectiveness in replacing human counterparts, questionnaire results show that human team members within the hybrid teams have a negative perception of their team's effectiveness and cohesion. The results align with prior research findings on the perception of teamwork in HAT that human teammates are usually perceived as more effective and facilitate more communication than autonomous partners [49]. This finding resonates strongly in terms of cooperative features, including effective communication and feedback, collaboration, and equal participation, where the hybrid teams all rate these as significantly inferior. These findings have strong implications for hybrid teams of the future. In order for human and AI Partners to effectively work together, there needs to be more of a positive team experience, as these factors correlate highly within problem-solving teams [57,58,80]. Even though the two team conditions perform similarly, whether these negative perceptions impacted their potential team performance cannot be answered here. Therefore, it could be one area of further experimentation.
6 Conclusion
This work analyzes and compares the team performance, problem-solving behaviors, and team experiences of human and human–AI hybrid teams during a complex, interdisciplinary design task. The AI Partners in this research emulate genuine team partners, being able to react to and proactively collaborate with their human teammates dynamically. A drastic change (i.e., shock) to the problem constraints is also introduced midway through the experiment to simulate an evolving engineering problem in practice. Results show that the hybrid teams perform similarly to human teams, indicating that the AI Partners can effectively replace and adeptly support their human design partners in terms of performance. However, a similar conclusion cannot be drawn in other aspects of the teamwork. Hybrid teams struggle with adapting their coordination and communication following the shock. The human members in the hybrid teams also perceive their team as inferior across several interpersonal dimensions, including the effectiveness of team communication, feedback, and the equality of contribution among other team members. These results imply that flexible AI isn’t always necessary, as humans can address and compensate for the deficiency of the AI teammate. However, inflexible AI causes increased human stress, which could be a hidden vulnerability for team performance and reliability for a prolonged duration. These findings point toward several areas of improvement for these hybrid teams, namely effective communication, cooperation, participation, and feedback between human and AI members of the team. In addition, the AI used in this work has limited intelligence and natural language processing capability, which may have affected team performance and behavior. Therefore, future work should address more skilled and capable AI agents with improved intelligence and verbal communication competency. Overall, the results provide insights into the behaviors of humans interacting with AI teammates and the factors needed to truly construct effective hybrid design teams.
Footnote
Acknowledgment
This work was supported by the Air Force Office of Scientific Research under Grant No. FA9550-18-1-0088 and the Defense Advanced Research Projects Agency through cooperative agreement No. N66001-17-1-4064. Any opinions, findings, and conclusions or recommendations expressed in this paper are those of the authors and do not necessarily reflect the views of the sponsors.
The author's affiliation with The MITRE Corporation is provided for identification purposes only and is not intended to convey or imply MITRE's concurrence with, or support for, the positions, opinions, or viewpoints expressed by the author.
Funding Data
Air Force Office of Scientific Research (AFOSR) Grant No. FA9550-18-1-0088.
Defense Advanced Research Projects Agency (DARPA), Cooperative Agreement No. N66001-17-1-4064.
Conflict of Interest
There are no conflicts of interest.
Data Availability Statement
The datasets generated and supporting the findings of this article are obtainable from the corresponding author upon reasonable request.