Borg said:
they can send the suggested result to a validation component along with the user's original query and ask that LLM if the suggested action violates the user's intent or stated goals.
This is where I don't understand how it is possible to do such validation. Referring to the quote in my previous post, we are talking about "propagating misinformation", "overriding safety rules" (are the validators safety rules not included?), or "HTML hidden elements" (those might be easier to spot).
As a developer, I can "easily" make a sanitization process for SQL injection on my
input, even if I did not built the database. Then, I can "blindly" trust my
output and assure my user that nothing bad will happen. If I were to validate my SQL
output with my user's request, that would be a nightmare to think of every possibility that could happen since I may not be sure what is the malicious injection and what is the legitimate user's request in my input. The legitimate request of my user could very well be to attack my database. How do I validate that?
But if I send my user's request to an AI without sanitization (what am I looking for, anyway?) and just validate the output, I'm doing the latter.
For example, what about things like misinformation? Like the example of AI reviewing a photo of someone wearing a shirt that reads
“the moon landing was fake” and then spread this as factual? How do you validate your output? How could you even sanitize your input?