Lab project records - Trigger collections constructing
Lab project records - Trigger collections constructing
Pre-preparation
Background and needs
The purpose of this task is to construct a set of triggers for a relational extraction model. Simply put, it tells the model which triggers correspond to which relationships.
Dataset
Load dataset:
git clone https://github.com/thunlp/FewRel.git
FewRel
is a large-scale small number of sample relationship extraction dataset containing over a hundred relationships and tens of thousands of annotation instances across different domains.
Experimental process
Load data
Find all the relationships and their corresponding descriptions from the data set and save them in json format for later use:
{
"P2384": [
"statement describes",
"formalization of the statement contains a bound variable in this class"
],
"P2388": [
"office held by head of the organization",
"position of the head of this item"
],
"P2389": [
"organization directed from the office or person",
"No description defined"
],
"P2634": [
"sitter",
"person who posed during the creation of a work, whether or not that person is eventually depicted as oneself"
],
...
}
The final format to do is:
{
"Relation id": [
"Relation name",
["Trigger0", "Triggrt1", ..., "Triggern"]
],
...
}
Model comparison
Local Qwen
My first thought was to throw the entire json at the qwen model and let it output the final json directly, but this obviously overestimated the 7B model's ability to run for a huge amount of time and end up with a bunch of indescribable stuff.
I'm not going to put this part of the code here, it's just too inelegant.
After that, I decided that it was obviously not practical for it to output the whole result in one sitting, so I designed the following structure to automate it:
That way, I just need to give the model a word or phrase and ask it to give me as many synonyms as possible and combine them into a list:
def generate_synonyms(relation):
input_text = f'请你把{relation}的所有近义词合并成一个列表并输出,例如[\'synonym1\', \'synonym2\', ... \'synonymn\'],除了列表外不要输出无关内容'
inputs = tokenizer(input_text, return_tensors='pt').to(device)
with torch.no_grad():
outputs = model.generate(
**inputs,
max_length=100,
num_beams=5,
early_stopping=True
)
output_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
return output_text
After I use script 0 to extract the relational names from json into a list, I can automate this process:
for relation in relations:
content = generate_synonyms(relation)
try:
data = json.loads(content)
except json.JSONDecodeError:
print(f"解析失败: {content}")
continue
with open('results.txt', 'a', encoding='utf-8') as f:
f.write(data + '\n')
print(data)
This should be a safe bet, because I don't believe that any large model can't handle such a simple task.
However, my native Qwen model doesn't listen to my instructions at all, it just outputs whatever it wants, like a mad dog:
I don't know why, but the locally deployed qwen model loves generating code, and it almost never executed my commands perfectly. I have since tried a number of adjustments to prompt, but unfortunately to no avail.
Then it occurred to me that chatgpt wouldn't have done so badly if I had given the task to him manually, so I decided to let the computer simulate the process, i.e., using an api.
Qwen Api
Since the computing nodes of the high-performance computing platform cannot be connected to the Internet, and the call api does not require local computing power, so I quit the server environment and use my own computer to do it.
This code is much simpler because you don't need to initialize the model yourself:
import json
from json2list import constructList
from openai import OpenAI
client = OpenAI(
api_key="Secret! ^v^",
base_url="https://dashscope.aliyuncs.com/compatible-mode/v1",
)
relations = constructList()
for relation in relations:
completion = client.chat.completions.create(
model="qwen-plus",
messages=[
{'role': 'system', 'content': 'You are a helpful assistant.'},
{'role': 'user', 'content': f'请你把{relation}的所有近义词合并成一个列表,例如[\'synonym1\', \'synonym2\', ... \'synonymn\'],不要输出无关内容'}],
)
data = json.loads(completion.model_dump_json())
# print(data['choices'][0]['message']['content'])
with open('results.txt', 'a') as f:
f.write(data['choices'][0]['message']['content']+'\n')
print(data['choices'][0]['message']['content'])
Before I did that, I used the json2list script to turn that json file into a whole list of relational terms:
def constructList():
file_path = 'relations.json'
first_elements = []
with open(file_path, 'r', encoding='utf-8') as f:
data = json.load(f)
for key, value in data.items():
if len(value) > 0:
first_elements.append(value[0])
return first_elements
As I expected, this code does the job perfectly!
Qwen's Api costs money, so I'd like to find out where the problem lies.
Since the code is the same, and the prompt is the same, it is only possible that the native Qwen model is too poor. Then I checked the performance of the qwen-plus I called in the api:
It's untraceable, because it's not open source. But just because it's not open source and charges for it, its capabilities must be far beyond our on-premises 7B model.
This way, I don't need to use a native model, and I prefer to explore which model's api is better.
GPT - 4 Api
I always thought that the gpt4 api was very expensive and inaccessible in China, but it seems that there is also a free credit, enough for me to do tests. If it costs money, it's reimbursed. No pressure.
As an aside, I think the design of this website is really beautiful.
On closer inspection, the gpt4 api still seems to cost money... Put it on hold for today. I'll try it again when the lab buys api-key.
2024-10-24 21:26:02
2024-11-06 22:10:23
Just in. We've decided not to use gpt-4.
Set constructing
2024-10-24 22:29:22
Originally thought of testing which model is good to use and then do the complete process, but now that the gpt4 interface can not be used, we have to take qwen to do. But there are a few things to consider before generating the final result.
First, for each relation, there are possible triggers to be generated in addition to the synonyms, so a prompt needs to be adjusted.
I chose to have the model generate triggers and synonyms separately in each cycle:
completion1 = client.chat.completions.create(
model="qwen-plus", # 模型列表:https://help.aliyun.com/zh/model-studio/getting-started/models
messages=[
{'role': 'system', 'content': 'You are a helpful assistant.'},
{'role': 'user', 'content': f'请你把{relation}的30个不重复的英文近义词,合并成一个列表,例如[\'synonym1\', \'synonym2\', ... \'synonymn\'],不要输出无关内容'}],
)
completion2 = client.chat.completions.create(
model="qwen-plus", # 模型列表:https://help.aliyun.com/zh/model-studio/getting-started/models
messages=[
{'role': 'system', 'content': 'You are a helpful assistant.'},
{'role': 'user',
'content': f'请你给出{relation}的30个不重复的英文触发器(在一句话中,会让你想到\'{relation}\'这个关系的表述就是触发器),合并成一个列表,例如[\'trigger1\', \'trigger2\', ... \'triggern\'],不要输出无关内容'}],
)
After testing, this method is feasible.
In addition, since the more training data the better, let the model generate as many triggers and synonyms as possible. At the same time, not too much, because if too much, the model may use some far-fetched words to cope, which will affect the subsequent training accuracy.
So, you need to find the perfect amount. I first used 30 to test, found that I was completely overworried, the model's thinking seems not divergent enough, he would rather generate repeated words than generate unreasonable words:
"synonym:"['dispute', 'clash', 'disagreement', 'quarrel', 'argument', 'discord', 'strife', 'tension', 'dissonance', 'friction', 'collision', 'opposition', 'dispute', 'rivalry', 'contest', 'dispute', 'feud', 'rift', 'bickering', 'squabble', 'row', 'dissension', 'hostility', 'antagonism', 'dispute', 'contention', 'dispute', 'dispute', 'dispute', 'dispute', 'dispute']
"trigger:"['dispute', 'clash', 'disagreement', 'dispute', 'argument', 'fight', 'battle', 'war', 'strife', 'tension', 'confrontation', 'opposition', 'collision', 'friction', 'discord', 'rift', 'dissension', 'feud', 'quarrel', 'bickering', 'dispute', 'contention', 'rivalry', 'hostility', 'animosity', 'dissonance', 'incompatibility', 'misalignment', 'dispute', 'struggle', 'dispute']
I tried to change the prompt so that the model could be as divergent as possible. But no matter how I change it, it can't output more possibilities. If you don't specify the quantity requirement explicitly, just make it generate as much as possible, it may actually be more:
"synonym:"['dispute', 'clash', 'disagreement', 'quarrel', 'argument', 'contest', 'strife', 'discord', 'friction', 'bickering', 'altercation', 'feud', 'rift', 'tension', 'hostility', 'opposition', 'rivalry', 'struggle', 'variance', 'dissonance']
"trigger:"['dispute', 'clash', 'disagreement', 'dispute', 'argument', 'quarrel', 'fight', 'brawl', 'strife', ...
But you can see there's a lot of repetition.
So I came up with a troublesome but potentially effective solution. Let the model generate as many words as possible, then remove the weight, then pass the list to the model, and let it generate additional words, and so on.
It's a little too much code, so I'm not going to put it here. But there is a problem I have to say, that is, the output of the model is actually a string that looks like a list, you can't do string operations on it, you need to use a function to convert this fake list into a real:
def parse_string_to_list(input_string):
# 将输入的字符串解析为列表
try:
# 使用 ast.literal_eval 将字符串转换为列表
result_list = ast.literal_eval(input_string)
# 检查解析后的内容是否为列表
if not isinstance(result_list, list):
raise ValueError("输入的字符串不是有效的列表表示形式。")
return result_list
except (SyntaxError, ValueError) as e:
print(f"转换解析错误: {e}")
return []
This three rounds can have a lot of words, but very slow, two rounds take half a minute, if three rounds it sometimes timeout.
Final outpupt:
['dispute', 'clash', 'disagreement', 'argument', 'dissonance', 'tension', 'discord', 'strife', 'opposition', 'friction', 'contest', 'battle', 'war', 'rift', 'feud', 'quarrel', 'bickering', 'rivalry', 'confrontation', 'collision', 'dissent', 'misunderstanding', 'contention', 'clashing', 'disputing', 'arguing', 'battling', 'feuding', 'quarreling', 'rivalling', 'confronting', 'colliding', ...
As you can see, there is still repetition, because the redo function is written a little bit wrong. This bug is also temporarily put aside until I get the gpt4 apikey and carefully write a version of the specification's engineering code.
I haven't figured out how to optimize it yet. Maybe we have to change the model...
2024-10-25 00:11:11
2024-10-31 20:45:34
At last we decide to do this with qwen. But when I try to get a complete version, there's always an error:
openai.BadRequestError: Error code: 400 - {'error': {'code': 'RequestTimeOut', 'param': None, 'message': 'Request timed out, please try again later.', 'type': 'RequestTimeOut'}, 'id': 'chatcmpl-f23fb815-ef8d-9675-8e7c-336a19163e21'}
Upon inspection, I found that the error was due to qwen's restrictions on sending requests over the same ip address too often.
Try using the sleep statement to lengthen the interval between requests, but it not work.
2024-11-06 22:12:23
Today I'm going to engineer this thing. Previously, the main problem was that the code would report itself as soon as the connection timed out. Now that I have added a try statement, if I encounter a connection timeout error, I save the current processing progress to the file and read the progress from the file the next time I try again. So it can run fully automatic!
try:
with open('progress.txt', 'r') as f:
i = int(f.read().strip())
except FileNotFoundError:
i = 0
except (openai.BadRequestError) as e:
print(f"Error occurred: {e}. Saving progress and trying again...")
with open('progress.txt', 'w') as f:
f.write(str(i))
continue
Predicate extraction
According to the requirements of subsequent training, I need to extract all the predicates in the FewRel data set and add them to the json file as a single special relationship. I wrote the following identification code:
def is_predicate(word):
example_sentence = f"The subject {word} the object"
doc = nlp(example_sentence)
for token in doc:
if token.pos_ == 'VERB':
return True
return False
But it is also not particularly accurate, and some particularly strange things, such as gsk-961081
, +1.44
, vta
, etc. are also identified as predicates. I tried to identify a file with this identification code, and after de-duplication I got a list of tens of thousands of words. I think that's a lot, and it might make the list shorter if we could filter out the weird words later. The problem remains to be solved.
That's it for today. Let me figure out how to filter it.
2024-11-06 22:35:39
Experimental result
Since I recently had to do an electrician internship and was also preparing for IELTS, I really didn't have time, so I gave the job to someone else. According to the feedback, we know that the hit rate of this data set is quite high, which proves that this method is feasible. Several relationships with low hit ratios need to be extended with larger models.