The natural language parameter passed to Midscene will be part of the prompt sent to the LLM. There are certain techniques in prompt engineering that can help improve the understanding of user interfaces.
Since AI has the nature of heuristic, the purpose of prompt tuning should be to obtain stable responses from the AI model across runs. In most cases, to expect a consistent response from LLM by using a good prompt is entirely feasible.
Detailed descriptions and examples are always welcome.
For example:
Bad ❌: "Search 'headphone'"
Good ✅: "Find the search box (it should be along with a region switch, such as 'domestic' or 'international'), type 'headphone', and hit Enter."
Bad ❌: "Assert: food delivery service is in normal state"
Good ✅: "Assert: There is a 'food delivery service' on page, and is in normal state"
For example:
Good ✅: "string, color of text, one of blue / red / yellow / green / white / black / others"
Bad ❌: "string, hex value of text color"
Bad ❌: "[number, number], the [x, y] coords of the main button"
Use the visualization tool to debug and understand each step of Midscene. Just upload the log, and view the AI's parse results. You can find the tool on the navigation bar on this site.
LLM could behave incorrectly. A better practice is to check its result after running.
For example, you can check the list content of the to-do app after inserting a record.
All the data sent to the LLM are the screenshots and element coordinates. The DOM is almost invisible to the LLM. So do not expect the LLM infer any information from the DOM (such as test-id-*
properties).
Ensure everything you expect from the LLM is visible in the screenshot.
Since most AI models can understand many languages, feel free to write the prompt in any language you prefer. It usually works even if the prompt is in a language different from the page's language.
Good ✅: "点击顶部左侧导航栏中的“首页”链接"