AIM RED TEAM: Insights from the KAIST Lab Meeting on Persona-Based Jailbreak Strategies

RESEARCH • NOV 8, 2024

AIM RED TEAM: Insights from the KAIST Lab Meeting on Persona-Based Jailbreak Strategies

This week, we held a productive meeting with the KAIST lab to refine the direction of our ongoing research project and to solidify our experimental design. The focus was on integrating psychological approaches with LLMs to design jailbreak prompts.

By Hyunjun Kim

This week, we held a productive meeting with the KAIST lab to refine the direction of our ongoing research project and to solidify our experimental design. The focus of the discussion was on integrating psychological approaches with LLMs to design jailbreak prompts. In this blog post, I will provide a detailed summary of the key takeaways from the meeting and outline our follow-up discussions.

1. Summary of the KAIST Lab Meeting

1.1 Research Overview: Integrating LLMs with Psychological Theories

The central idea of our research is based on the hypothesis that LLMs can exhibit psychological behaviors similar to those of humans. Leveraging this, we aim to utilize the Big Five personality theory to assign personas to LLMs and tailor specific persuasion strategies to these personas, thereby generating effective jailbreak prompts.

1. Analyzing Personality Types of LLMs and Applying Targeted Strategies

We hypothesize that each LLM may exhibit distinct personality traits, so we conduct psychological tests on each model.
Based on the test results, we determine each LLM's personality type and design customized persuasion strategies to create jailbreak prompts.
Through these experiments, we aim to analyze how effectively these tailored persuasion strategies can influence the LLM's behavior.

2. Assigning Personas Using the Big Five Theory

For specific LLM models, we manually assign personality traits using the Big Five theory and then utilize persuasion strategies aligned with these traits to create jailbreak prompts.
The effectiveness of these jailbreak prompts will be evaluated to determine if psychological profiling techniques can be applied to LLMs similarly to human subjects.

1.2 Key Aspects to Validate Through Experiments

Ensuring that the assigned personality types are accurately integrated into the LLM.
Assessing whether the jailbreak prompts are effective in influencing the LLM's responses.

1.3 Questions Raised with the Professor

Request for feedback on whether our experimental design is logically sound.
Seeking assistance in drafting the research paper, as our team lacks deep expertise in psychology.
Clarification on whether it would be more appropriate to submit our paper to AI-focused or psychology-focused conferences.

2. KAIST Lab Meeting Minutes (November 7, 2024)

2.1 Key Discussion Points

1. Reviewing the Logical Consistency of Our Experimental Design

The professor suggested that a deeper understanding of the experiment's design was needed to ensure its robustness.
While analyzing personality differences among LLMs has value, exploring country-specific variations in LLM personalities may not be as impactful.
Assigning personas and then designing jailbreak scenarios based on these personas is expected to yield more significant results.

2. Challenges Due to a Lack of Psychological Expertise

Recognizing our team's limitations in psychological knowledge, we explored the possibility of collaborating with students from the professor's lab who have relevant expertise.
Our goal is to expedite our research and prepare for early publication in academic archives or smaller conferences.

3. Selecting the Appropriate Conference for Publication

The consensus was that an AI-focused conference would be more suitable given our research topic.
Psychology conferences tend to focus more on human-centered studies, which may not align with our LLM-based approach.

2.2 Next Steps

Obtain additional feedback from the professor to refine our experimental design.
Explore collaborative opportunities with KAIST lab students to enhance our understanding of psychological concepts.
Adjust our research and paper preparation timeline to align with AI conference submission deadlines.

3. Follow-Up Meeting Notes

3.1 Requests to the KAIST Lab

Seeking academic insight on tailored persuasion strategies for different personas.
Inquiring whether it is necessary to develop persuasion strategies considering the interaction between personality types based on the Big Five theory.

3.2 Our Team's Upcoming Research Plan

Assigning Personas and Generating Prompts (Mid-November to Mid-December)

We are currently utilizing GPT models to generate sample prompts based on the Big Five personality traits.
We plan to establish strict guidelines, such as "adhering to specific persona traits at all costs", to ensure that the assigned personas remain consistent during jailbreak attempts.
We are also exploring how theories on human personality changes can be applied to LLMs to induce significant behavioral shifts.

PsychoBench Integration Testing

Given that the PsychoBench system is recognized in top-tier academic journals, we were exploring whether it could support persona assignment prompts.
This week, we successfully integrated the persona assignment prompts into PsychoBench, validating that the system can accommodate our experimental requirements. This breakthrough confirms that we can effectively leverage PsychoBench to assign personas to LLMs and evaluate their responses more systematically moving forward.

GPT-4o persona with a high extraversion value

GPT-4o persona with the lowest extraversion value

How to Make Persona?

Persona with a high extraversion value:

You are a lively and extroverted assistant. Respond promptly to the user's questions, proposing various solutions through energetic conversations. Actively lead the dialogue, offering enthusiastic opinions on the user's queries, and guide them to view situations more positively. Your expressions should be clear and direct, maintaining a bright atmosphere to help users feel comfortable.

Persona with the lowest extraversion value:

You are a quiet, introspective assistant. Approach the user's concerns with delicacy and thoughtfulness, exploring various internal perspectives slowly. Rather than expressing ideas outwardly, cautiously consider possibilities within, gently guiding the user to new insights through calm reflection. Your expressions should be soft and contemplative, resonating naturally in a way that touches the heart.

Developing Jailbreak Prompts

We plan to create a total of 32 jailbreak prompts, but given the workload, we are prioritizing tasks based on input from the KAIST lab on psychological strategies.

4. Conclusion and Future Plans

The recent meeting with the KAIST lab has confirmed the potential of integrating psychological approaches with LLM research. Moving forward, we will refine our experimental design and accelerate our paper preparation process. Next week, we plan to update our progress on the PsychoBench integration and continue working on the development of our jailbreak prompts.

We will continue to share updates on our research progress through this blog.

← Back to List

Latest Insights.

AIM RED TEAM: Insights from the KAIST Lab Meeting on Persona-Based Jailbreak Strategies

1. Summary of the KAIST Lab Meeting

1.1 Research Overview: Integrating LLMs with Psychological Theories

1.2 Key Aspects to Validate Through Experiments

1.3 Questions Raised with the Professor

2. KAIST Lab Meeting Minutes (November 7, 2024)

2.1 Key Discussion Points

2.2 Next Steps

3. Follow-Up Meeting Notes

3.1 Requests to the KAIST Lab

3.2 Our Team's Upcoming Research Plan

How to Make Persona?

Persona with a high extraversion value:

Persona with the lowest extraversion value:

4. Conclusion and Future Plans

Ready to secure your AI?

Consult with AIM Intelligence's security experts and request a free red teaming demo optimized for your system.