Skip to content
Open
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
15 changes: 10 additions & 5 deletions clickclickclick/config/models.yaml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
gemini:
api_key: !ENV GEMINI_API_KEY
model_name: gemini-1.5-flash
model_name: gemini-3-flash-preview

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The model name gemini-3-flash-preview does not appear to be a standard, publicly available Google Gemini model name. This might be a typo and could cause API requests to fail. Please verify this is the correct model identifier. For reference, a recent flash model is named gemini-1.5-flash-latest.

image_width: 768
image_height: 768
output_width: 1000 # max range of outputted values
Expand All @@ -12,24 +12,29 @@ gemini:
osx:
image_width: 768
generation_config:
temperature: 0.7
temperature: 1.0
top_p: 0.95
top_k: 40
max_output_tokens: 200
max_output_tokens: 65536
Comment on lines +15 to +17

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The configuration for the planner has been updated to set max_output_tokens to 65536, which is a very high value and might have cost or performance implications. Additionally, the top_k parameter was removed, which alters the text generation sampling strategy. It would be beneficial to confirm if these changes were intentional and if such a high token limit is necessary.

response_mime_type: text/plain
finder:
image_width: 768
image_height: 768
output_width: 1000 # max range of outputted values
output_height: 1000
android:
image_width: 768
image_height: 768
output_width: 1000 # max range of outputted values
output_height: 1000
osx:
image_width: 768
output_width: 1000
output_height: 1000
generation_config:
temperature: 0.95
top_p: 0.99
top_k: 20
max_output_tokens: 80
max_output_tokens: 4096
response_mime_type: application/json

openai:
Expand Down
12 changes: 10 additions & 2 deletions clickclickclick/config/prompts.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -39,7 +39,11 @@ android:

gemini:
finder-system-prompt: |
You provide the bounds of the UI elements/text you see in the picture ymin,xmin,ymax,xmax, In case of not found just output 0,0,0,0 instead. Not finding is equally important. I repeat your only options are coordinates or 'not found'. Do not write any other word. Internally describe how the element looks and what colors it has and what its shape is. When you are not confident just output 0,0,0,0.
Return ONLY a JSON object with the bounding box coordinates of the UI element in the image.
Required JSON format: {"ymin": <int>, "xmin": <int>, "ymax": <int>, "xmax": <int>}
Coordinate range: 0-1000 for both x and y axes.
If element not found: {"ymin": 0, "xmin": 0, "ymax": 0, "xmax": 0}
Do NOT include any explanatory text, markdown, or formatting - ONLY the raw JSON object.
openai:
finder-system-prompt: |
Assume image size as 512x512. You provide the bounds of the UI elements/text you see in the picture ymin,xmin,ymax,xmax, In case of not found just output 0,0,0,0 instead. Not finding is equally important. I repeat your only options are coordinates or 'not found'. Do not write any other word. Internally describe how the element looks and what colors it has and what its shape is. When you are not confident just output 0,0,0,0.
Expand Down Expand Up @@ -86,7 +90,11 @@ osx:
You provide the bounding box of the UI elements/text you see in the picture in this format: ymin,xmin,ymax,xmax, In case of not found just output 0,0,0,0 or 'not found' instead. You do not write anything else. Not finding is equally important
gemini:
finder-system-prompt: |
You provide the bounds of the UI elements/text you see in the picture ymin,xmin,ymax,xmax, In case of not found just output 0,0,0,0 instead. Not finding is equally important. I repeat your only options are coordinates or 'not found'. Do not write any other word. Internally describe how the element looks and what colors it has and what its shape is. When you are not confident just output 0,0,0,0.
Return ONLY a JSON object with the bounding box coordinates of the UI element in the image.
Required JSON format: {"ymin": <int>, "xmin": <int>, "ymax": <int>, "xmax": <int>}
Coordinate range: 0-1000 for both x and y axes.
If element not found: {"ymin": 0, "xmin": 0, "ymax": 0, "xmax": 0}
Do NOT include any explanatory text, markdown, or formatting - ONLY the raw JSON object.
openai:
finder-system-prompt: |
Assume image size as 512x512. You provide the bounds of the UI elements/text you see in the picture ymin,xmin,ymax,xmax, In case of not found just output 0,0,0,0 instead. Not finding is equally important. I repeat your only options are coordinates or 'not found'. Do not write any other word. Internally describe how the element looks and what colors it has and what its shape is. When you are not confident just output 0,0,0,0.
Expand Down
38 changes: 34 additions & 4 deletions clickclickclick/finder/gemini.py
Original file line number Diff line number Diff line change
@@ -1,9 +1,11 @@
import google.generativeai as genai
from . import BaseFinder
from . import BaseFinder, FinderResponseLLM
from clickclickclick.config import BaseConfig
from clickclickclick.executor import Executor
from tempfile import NamedTemporaryFile
from PIL import Image
import json
import re

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

To support the exception handling logic added below, traceback should be imported here at the top of the file, rather than inside a function. This follows Python best practices.

Suggested change
import re
import re
import traceback



class GeminiFinder(BaseFinder):
Expand All @@ -19,9 +21,12 @@ def __init__(self, c: BaseConfig, executor: Executor):
self.OUTPUT_HEIGHT = finder_config.get("output_height")
api_key = finder_config.get("api_key")
model_name = finder_config.get("model_name")
generation_config = finder_config.get("generation_config")
generation_config = finder_config.get("generation_config", {})

super().__init__(api_key, model_name, generation_config, system_prompt, executor)
genai.configure(api_key=api_key)

print(f"DEBUG - Generation config: {generation_config}")

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This print statement appears to be for debugging. It's recommended to use the logging module instead, as it provides more control over log verbosity and output destinations. Please consider replacing this with a logger.debug() call after importing and configuring a logger.

Suggested change
print(f"DEBUG - Generation config: {generation_config}")
logger.debug(f"DEBUG - Generation config: {generation_config}")

self.model = genai.GenerativeModel(
model_name=model_name,
generation_config=generation_config,
Expand All @@ -41,12 +46,37 @@ def process_segment(self, segment, model, prompt, retries=3):
response = self.model.generate_content(
[segment_image, self.element_finder_prompt(prompt)]
)
response_text = response.text
print(response_text, " resp text")
response_text = response.text.strip()
print(f"DEBUG - Gemini raw response: '{response_text}'")
print(f"DEBUG - Response type: {type(response_text)}")

# Try to extract JSON from response
# Sometimes Gemini wraps JSON in markdown code blocks
json_match = re.search(r'```json\s*(\{.*?\})\s*```', response_text, re.DOTALL)
if json_match:
response_text = json_match.group(1)
print(f"DEBUG - Extracted JSON from markdown: '{response_text}'")

# Or it might just have extra text before/after
json_match = re.search(r'\{[^{}]*"ymin"[^{}]*\}', response_text)
if json_match:
response_text = json_match.group(0)
print(f"DEBUG - Extracted JSON object: '{response_text}'")

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The current logic for extracting JSON is slightly inefficient as it attempts the second regex search even if the first one (for markdown blocks) was successful. You can make this more efficient and readable by using an else block to ensure the second search only runs if the first one fails.

Suggested change
json_match = re.search(r'```json\s*(\{.*?\})\s*```', response_text, re.DOTALL)
if json_match:
response_text = json_match.group(1)
print(f"DEBUG - Extracted JSON from markdown: '{response_text}'")
# Or it might just have extra text before/after
json_match = re.search(r'\{[^{}]*"ymin"[^{}]*\}', response_text)
if json_match:
response_text = json_match.group(0)
print(f"DEBUG - Extracted JSON object: '{response_text}'")
json_match = re.search(r'```json\s*(\{.*?\})\s*```', response_text, re.DOTALL)
if json_match:
response_text = json_match.group(1)
print(f"DEBUG - Extracted JSON from markdown: '{response_text}'")
else:
# Or it might just have extra text before/after
json_match = re.search(r'\{[^{}]*"ymin"[^{}]*\}', response_text)
if json_match:
response_text = json_match.group(0)
print(f"DEBUG - Extracted JSON object: '{response_text}'")


# Validate it's valid JSON
try:
parsed = json.loads(response_text)
print(f"DEBUG - Parsed JSON successfully: {parsed}")
except json.JSONDecodeError as je:
print(f"DEBUG - JSON decode error: {je}")
print(f"DEBUG - Failed to parse: '{response_text}'")

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Numerous print statements have been added for debugging purposes. It is strongly recommended to replace these with calls to the logging module (e.g., logger.debug(...), logger.error(...)). This allows for better control over log levels and avoids polluting stdout in production environments.


return (response_text, coordinates)
except Exception as e:
# Log the exception or handle it as necessary
print(f"Attempt {attempt + 1} failed with exception: {e}")
import traceback

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

This local import should be moved to the top of the file to adhere to Python best practices (PEP 8) and improve efficiency. I have added a suggestion to move it in the import section above.

traceback.print_exc()

# Increment the attempt counter
attempt += 1
Expand Down