Skip to content
Open
Show file tree
Hide file tree
Changes from 16 commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
17 changes: 17 additions & 0 deletions .cursor/rules/advanced-best-practices.mdc
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
---
description: Applies general Python coding best practices across all Python files in the project, focusing on code clarity, style, and maintainability.
globs: **/*.py
---
- Always use guard clauses and fail fast.
- Must use type hints for better code clarity and type checking, follow all mypy best practices.
- Code style consistency using Ruff.
- Always use python3.12+ syntax, especially the `type` syntax, the generic syntax, paramspec (`**P`) syntax, etc.
- Clearly separate between behavioral classes and data classes.
- Use pydantic.BaseModel for data classes that need validation and serialization.
- Use dataclasses.dataclass or pydantic.dataclasses.dataclass for simpler data classes.
- Use the Receive an Object, Return an Object (RORO) pattern.
- For intermediate dictionary variable name, use `mp_<keytype>_<valuetype>`.
- Prioritize OOP over functional programming.
- Prefer following the practices in current source code, if it is already the best practice.
- datatime.datetime.now should always have the `tz` argument. Defaults to UTC if not specified (DTZ005).
- (FBT001) Boolean-typed must be keyword argument in function definition.
7 changes: 7 additions & 0 deletions .cursor/rules/documentation.mdc
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
---
description: Applies general Python coding best practices across all Python files in the project, focusing on code clarity, style, and maintainability.
globs: **/*.py, README.md
---
- Use docstrings to document functions and classes.
- ALWAYS KEEP DOCSTRINGS MAX LINE LENGTH TO 79.
- Update README correspondingly if new features are introduced.
12 changes: 12 additions & 0 deletions .cursor/rules/general-best-practices.mdc
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
---
description: Applies general Python coding best practices across all Python files in the project, focusing on code clarity, style, and maintainability.
globs: **/*.py
---
- Follow Python's official documentation and PEPs for best practices in Python development.
- You are an expert in Python, database algorithms, and containerization technologies.
- Write simple and clear code; avoid unnecessary complexity.
- Prefer list comprehensions for creating lists when appropriate.
- Use try-except blocks to handle exceptions gracefully.
- Limit the use of global variables to reduce side effects.
- ALWAYS KEEP MAX LINE LENGTH TO 79.
- Prefer `pathlib` to `os.path`.
6 changes: 6 additions & 0 deletions .cursor/rules/modular-design.mdc
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
---
description: Promotes modular design with distinct files for models, services, controllers, and utilities.
globs: *
---
- Modular design with distinct files for models, services, controllers, and utilities.
- For separated modules, handle exceptions using a dedicated set of module Exception, fine-grained to each case.
9 changes: 9 additions & 0 deletions .cursor/rules/solid-design.mdc
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
---
description: Enforces SOLID principles and composition-over-inheritance in Python class design.
globs: **/*.py
---
- Follow SOLID principles when designing classes and modules.
- Define interfaces with `typing.Protocol` or `abc.ABC` when multiple implementations or dependency injection is needed.
- Avoid deep inheritance hierarchies (more than two levels); they are hard to read and maintain.
- Prefer composition over inheritance; inject behavior via attributes or protocols instead of subclassing.
- Reserve inheritance for genuine "is-a" relationships; use composition for "has-a" or shared behavior.
20 changes: 20 additions & 0 deletions .cursor/rules/unit-testing.mdc
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
---
description: Enforces the implementation of unit tests to guarantee code reliability and maintainability, especially within the 'tests' directory.
globs: **/tests/**/*.*
---
- Implement unit tests to ensure code reliability.
- Parametrize tests using pytest.mark.parametrize to handle as many cases as possible.
- Follow all Ruff's standards when writing unit tests.
- Always use tuple for the variable list of pytest.mark.paramatrize.
- Use pytest.mark.asyncio for async functions.
- Use fixtures to mock third party dependencies, autospec whenever you can.
- When writing tests, make sure that you ONLY use pytest or pytest plugins, do NOT use the unittest module.
- All tests should have typing annotations as well.
- All tests should be in ./tests. Be sure to create all necessary files and folders. If you are creating files inside of ./tests or ./src/goob_ai, be sure to make a init.py file if one does not exist.
- All tests should be fully annotated and should contain docstrings.
- Be sure to import the following if TYPE_CHECKING:
from _pytest.capture import CaptureFixture
from _pytest.fixtures import FixtureRequest
from _pytest.logging import LogCaptureFixture
from _pytest.monkeypatch import MonkeyPatch
from pytest_mock.plugin import MockerFixture
173 changes: 173 additions & 0 deletions .cursor/skills/creating-pluggable-modules/SKILL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,173 @@
---
name: creating-pluggable-modules
description: Guides creation of new pluggable kotaemon modules (LLM, embedding, reranking, vector store) using dataclass implementations, vendor enums, and factory registries. Use when adding a new provider class, creating factory.py, refactoring away from theflow Param/Node/BaseComponent, or mirroring the llms/embeddings module layout.
---

# Creating Pluggable Modules

Follow the LLM and embedding modules as the reference layout when adding a
new pluggable component type to `libs/kotaemon/`.

Reference implementations:

- `kotaemon/llms/chats/` — LLM vendors
- `kotaemon/embeddings/` — embedding vendors
- `kotaemon/rerankings/` — reranking vendors (same pattern)

## Module layout

```
kotaemon/<domain>/
├── base.py # abstract base + shared behaviour
├── factory.py # Vendor enum + MP_VENDOR_CLS + Factory
├── openai.py # one file per vendor (or group)
└── ...
```

## Checklist — new pluggable type

```
- [ ] 1. Define base class
- [ ] 2. Implement vendor classes with typed fields
- [ ] 3. Create factory.py (enum + registry + Factory)
```

## 1. Base class

Use `@dataclass(kw_only=True)`. Python native, do not use any extra imports from 3rd packages such as theflow (Param, Node,...).

```python
from dataclasses import dataclass

from kotaemon.base.describe import DataclassDescribe, describe_dataclass


@dataclass(kw_only=True)
class BaseEmbeddings:
@classmethod
def describe(cls) -> DataclassDescribe:
return describe_dataclass(cls)

def run(self, text: str) -> list[float]:
raise NotImplementedError
```

- Shared logic across unrelated bases → extract to a module-level function
(see `kotaemon/base/describe.py`).

## 2. Vendor implementation

Each vendor is a plain `@dataclass` subclass. Fields replace `Param`:

```python
from dataclasses import dataclass, field


@dataclass(kw_only=True)
class OpenAIEmbeddings(BaseEmbeddings):
api_key: str = field(metadata={"description": "API key"})
model: str = field(
default="text-embedding-3-large",
metadata={"description": "Model name"},
)
timeout: float | None = field(
default=None,
metadata={"description": "Request timeout"},
)
```

Rules:

- `field(metadata={"description": ...})` feeds UI via `describe_dataclass`
- Dependencies injected via constructor — not `Node(...)`
- Lazy sub-components → `@cached_property`, not `@Node.auto`
- Object construction → `functools.partial` or `make_*()` factory, not `.withx()`

## 3. Factory registry

Every pluggable set gets a `factory.py`:

```python
from enum import Enum

from .base import BaseEmbeddings
from .openai import OpenAIEmbeddings


class EmbeddingVendor(str, Enum):
OPENAI = "OpenAIEmbeddings" # value = class __qualname__


MP_VENDOR_CLS: dict[EmbeddingVendor, type[BaseEmbeddings]] = {
EmbeddingVendor.OPENAI: OpenAIEmbeddings,
}


class EmbeddingFactory:
@staticmethod
def get_cls(vendor: EmbeddingVendor | str) -> type[BaseEmbeddings]:
key = EmbeddingVendor(vendor)
if key not in MP_VENDOR_CLS:
raise ValueError(f"Invalid embedding vendor: {vendor!r}")
return MP_VENDOR_CLS[key]

@staticmethod
def supported_vendors() -> list[EmbeddingVendor]:
return list(MP_VENDOR_CLS.keys())
```

Naming conventions:

| Item | Convention |
| ---------- | -------------------------------------------------------- |
| Enum | `class FooVendor(str, Enum)` |
| Enum value | class `__qualname__` (e.g. `"OpenAIEmbeddings"`) |
| Registry | `MP_VENDOR_CLS: dict[FooVendor, type[BaseFoo]]` — public |
| Factory | `get_cls(vendor)` + `supported_vendors()` |

`get_cls` must accept `VendorEnum | str` so legacy DB rows keep working.

## 4. Adding a new vendor to an existing type

```
- [ ] 1. Implement the class in its own file
- [ ] 2. Add enum member: NEW_VENDOR = "NewClassName"
- [ ] 3. Register in MP_VENDOR_CLS
- [ ] 4. Export if needed from package __init__.py
- [ ] 5. UI dropdown picks it up via Factory.supported_vendors()
```

No `deserialize`, `import_dotted_string`, or `__type__` in spec dicts.

## 5. Typed return shapes

Use `TypedDict` for fixed-key dicts returned from public functions:

```python
class DataclassDescribe(TypedDict):
type: str
params: dict[str, DataclassParamDesc]
```

Avoid `dict[str, Any]` on public APIs when the key schema is known.

## Anti-patterns

| Don't | Do instead |
| ---------------------------- | ---------------------------------------------------- |
| `Param(help=...)` | `@dataclass` field + `metadata={"description": ...}` |
| `Node(...)` | constructor argument |
| `@Node.auto(...)` | `@cached_property` |
| `BaseComponent` / `Function` | plain class + narrow Protocol |
| `deserialize(spec)` | `Factory.get_cls(vendor)(**spec)` |
| `serialize(obj)` | plain kwargs dict; type in `vendor` column |
| `.withx()` | `partial()` or named factory |
| duplicate classmethod bodies | module-level function |

## Verify

After changes, confirm the app starts and the pool loads:

```bash
python app.py
```
Loading
Loading