Self-Operating Computer
Using the same inputs and outputs of a human operator, this framework enables multimodal AI models to view the screen and decide on a series of mouse and keyboard actions to reach an objective.
Integration
Currently integrated with GPT-4-Vision as the default model.
Compatibility
Designed for support across operating systems and to be used various multimodal models.
Future Plans
At HyperwriteAI, we are developing Agent-1-Vision, a multimodal model designed for operating software and computer interfaces, with more accurate click location predictions.
Agent-1-Vision Model API Access
We will soon be offering API access to our Agent-1-Vision model. If you're interested in gaining access to this API, sign up here:
Additional Thoughts
We recognize that some operating system functions may be more efficiently executed with hotkeys such as entering the Browser Address bar using command + L rather than by simulating a mouse click at the correct XY location.
We plan to make these improvements over time. However, it's important to note that many actions require the accurate selection of visual elements on the screen, necessitating precise XY mouse click locations.
A primary focus of this project is to refine the accuracy of determining these click locations. We believe this is essential for achieving a fully self-operating computer in the current technological landscape.
Join the Discussion and Contribute on GitHub
We encourage contributions and discussion via the Self Operating Computer GitHub page.
Our team is unable to provide custom support at this time.