Compare commits

..

49 Commits

Author SHA1 Message Date
joachim-danswer
013bed3157 fix 2025-06-30 15:19:42 -07:00
joachim-danswer
289f27c43a updates 2025-06-30 15:06:12 -07:00
joachim-danswer
736a9bd332 erase history 2025-06-30 09:01:23 -07:00
joachim-danswer
8bcad415bb nit 2025-06-30 08:16:43 -07:00
joachim-danswer
93e6e4a089 mypy nits 2025-06-30 07:49:55 -07:00
joachim-danswer
ed0062dce0 fix 2025-06-30 02:45:03 -07:00
joachim-danswer
6e8bf3120c hackathon v1 changes 2025-06-30 01:39:36 -07:00
Weves
e480946f8a Reduce frequency of heavy checks on primary for cloud 2025-06-28 17:56:34 -07:00
Evan Lohn
be25b1efbd perm sync validation framework (#4958)
* perm synce validation framework

* frontend fixes

* validate perm sync when getting runner

* attempt to fix integration tests

* added new file

* oops

* skipping salesforce test due to creds

* add todo
2025-06-28 19:57:54 +00:00
Chris Weaver
204493439b Move onyx_list_tenants.py to make sure it's in the image (#4966)
* Move onyx_list_tenants.py to make sure it's in the image

* Improve
2025-06-28 13:18:14 -07:00
Weves
106c685afb Remove CONCURRENTLY from migrations 2025-06-28 11:59:59 -07:00
Raunak Bhagat
809122fec3 fix: Fix bug in which emails would be fetched during initial indexing (#4959)
* Add new convenience method

* Fix bug in which emails would be fetched for initial indexing

* Improve tests for MS Teams connector

* Fix test_gdrive_perm_sync_with_real_data patching

* Protect against incorrect truthiness

---------

Co-authored-by: Weves <chrisweaver101@gmail.com>
2025-06-27 22:05:50 -07:00
Chris Weaver
c8741d8e9c Improve mt migration process (#4960)
* Improve MT migration process

* improve MT migrations

* Improve parallel migration

* Add additional options to env.py

* Improve script

* Remove script

* Simplify

* Address greptile comment

* Fix st migration

* fix run_alembic_migrations
2025-06-27 17:31:22 -07:00
Weves
885f01e6a7 Fix test_gdrive_perm_sync_with_real_data patching 2025-06-27 16:34:37 -07:00
Rei Meguro
3180a13cf1 source fix (#4956) 2025-06-27 13:20:42 -07:00
Rei Meguro
630ac31355 KG vespa error handling + separating relationship transfer & vespa updates (#4954)
* feat: move vespa at end in try block

* simplify query

* mypy

* added order by just in case for consistent pagination

* liveness probe

* kg_p check for both extraction and clustering

* fix: better vespa logging
2025-06-26 22:05:57 -07:00
Chris Weaver
80de62f47d Improve drive group sync (#4952)
* Improve drive group sync

* Improve group syncing approach

* Fix github action

* Improve tests

* address greptile
2025-06-26 20:14:35 -07:00
Raunak Bhagat
c75d42aa99 perf: Improve performance of MS Teams permission-syncing logic (#4953)
* Add function stubs for Teams

* Implement more boilerplate code

* Change structure of helper functions

* Implement teams perms for the initial index

* Make private functions start with underscore

* Implement slim_doc retrieval and fix up doc_sync

* Simplify how doc-sync is done

* Refactor jira doc-sync

* Make locally used function start with an underscore

* Update backend/ee/onyx/configs/app_configs.py

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>

* Add docstring to helper function

* Update tests

* Add an expected failure

* Address comment on PR

* Skip expert-info if user does not have a display-name

* Add doc comments

* Fix error in generic_doc_sync

* Move callback invocation to earlier in the loop

* Update tests to include proper list of user emails

* Update logic to grab user emails as well

* Only fetch expert-info if channel is not public

* Pull expert-info creation outside of loop

* Remove unnecessary call to `iter`

* Switch from `dataclass` to `BaseModel`

* Simplify boolean logic

* Simplify logic for determining if channel is public

* Remove unnecessary channel membership-type

* Add log-warns

* Only perform another API fetch if email is not present

* Address comments on PR

* Add message on assertion failure

* Address typo

* Make exception message more descriptive

---------

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
2025-06-27 01:41:01 +00:00
Raunak Bhagat
e1766bca55 feat: MS Teams permission syncing (#4934)
* Add function stubs for Teams

* Implement more boilerplate code

* Change structure of helper functions

* Implement teams perms for the initial index

* Make private functions start with underscore

* Implement slim_doc retrieval and fix up doc_sync

* Simplify how doc-sync is done

* Refactor jira doc-sync

* Make locally used function start with an underscore

* Update backend/ee/onyx/configs/app_configs.py

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>

* Add docstring to helper function

* Update tests

* Add an expected failure

* Address comment on PR

* Skip expert-info if user does not have a display-name

* Add doc comments

* Fix error in generic_doc_sync

* Move callback invocation to earlier in the loop

* Update tests to include proper list of user emails

* Update logic to grab user emails as well

* Only fetch expert-info if channel is not public

* Pull expert-info creation outside of loop

* Remove unnecessary call to `iter`

* Switch from `dataclass` to `BaseModel`

* Simplify boolean logic

* Simplify logic for determining if channel is public

* Remove unnecessary channel membership-type

* Add log-warns

---------

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
2025-06-26 22:36:09 +00:00
Rei Meguro
211102f5f0 kg cleanup + reintroducing deep extraction & classification (#4949)
* kg cleanup

* more cleanup

* fix: copy over _get_classification_content_from_call_chunks for content formatting

* added back deep extraction logic

* feat: making deep extraction and clustering work

* nit
2025-06-26 14:46:50 -07:00
Weves
c46cc4666f Fix query history 2 2025-06-25 21:35:53 -07:00
joachim-danswer
0b2536b82b expand definition of public 2025-06-25 20:01:09 -07:00
Rei Meguro
600a86f11d Add creator to linear (#4948)
* add creator to linear

* fix: mypy
2025-06-25 18:19:36 -07:00
Rei Meguro
4d97a03935 KG Attribute Overhaul + Processing Tests (#4933)
* feat: extract email

* title

* feat: new type definition

* working

* test and bugfix

* fix: set docid

* fix: mypy

* feat: show implied entities too

* fix import + migration

* fix: added random delay for vespa

* fix: mypy

* mypy again...

* fix: nit

* fix: mypy

* SOLUTION!

* fix

* cleanup

* fix: transfer

* nit

---------

Co-authored-by: joachim-danswer <joachim@danswer.ai>
2025-06-25 05:06:12 +00:00
Raunak Bhagat
5d7169f244 Implement JIRA permission syncing (#4899) 2025-06-24 23:59:26 +00:00
Wenxi
df9329009c curator bug fixes (#4941)
* curator bug fixes

* basic users default to my files

* fix admin param + move delete button

* fix trashcan admin only

---------

Co-authored-by: Wenxi Onyx <wenxi-onyx@Wenxis-MacBook-Pro.local>
2025-06-24 21:33:47 +00:00
Arun Philip
e74a0398dc Update Docker Compose restart policy to unless-stopped
Changed the restart policy to unless-stopped to ensure containers
automatically restart after failures or reboots but allow manual stop
without immediate restart.

This is preferable over always because it prevents containers from
restarting automatically after a manual stop, enabling controlled
shutdowns and maintenance without unintended restarts.
2025-06-24 13:27:50 -07:00
SubashMohan
94c5822cb7 Add MinIO configuration to env template and update restart script for MinIO container (#4944)
Co-authored-by: Subash <subash@onyx.app>
2025-06-24 17:21:16 +00:00
joachim-danswer
dedac55098 KG extraction without vespa queries (#4940)
* no vespa in extraction

* prompt/flow improvements

* EL comments

* nit

* Updated get_session_with_current_tenant import

---------

Co-authored-by: Rei Meguro <36625832+Orbital-Web@users.noreply.github.com>
2025-06-24 15:02:50 +00:00
Chris Weaver
2bbab5cefe Handle very long file names (#4939)
* Handle very long file names

* Add logging

* Enhancements

* EL comments
2025-06-23 19:22:02 -07:00
joachim-danswer
4bef718fad fix kg db proxy (#4942) 2025-06-23 18:27:59 -07:00
Chris Weaver
e7376e9dc2 Add support for db proxy (#4932)
* Split up engine file

* Switch to schema_translate_map

* Fix mass serach/replace

* Remove unused

* Fix mypy

* Fix

* Add back __init__.py

* kg fix for new session management

Adding "<tenant_id>" in front of all views.

* additional kg fix

* better handling

* improve naming

---------

Co-authored-by: joachim-danswer <joachim@danswer.ai>
2025-06-23 17:19:07 -07:00
Raunak Bhagat
8d5136fe8b Fix error in which curator sidebars were hitting kg-exposed endpoint 2025-06-23 17:07:11 -07:00
joachim-danswer
3272050975 docker dev and prod template (#4936)
* docker dev and prod template

* more dev files
2025-06-23 21:43:42 +00:00
Weves
1960714042 Fix query history 2025-06-23 14:32:14 -07:00
Weves
5bddb2632e Fix parallel tool calls 2025-06-23 09:50:44 -07:00
Raunak Bhagat
5cd055dab8 Add minor type-checking fixes (#4916) 2025-06-23 13:34:40 +00:00
Raunak Bhagat
fa32b7f21e Update ruff and remove ruff-formating from pr checks (#4914) 2025-06-23 05:34:34 -07:00
Rei Meguro
37f7227000 fix: too many vespa request fix (#4931) 2025-06-22 14:31:42 -07:00
Chris Weaver
c1f9a9d122 Hubspot connector enhancements (#4927)
* Enhance hubspot connector

* Add companies, deals, and tickets

* improve typing

* Add HUBSPOT_ACCESS_TOKEN to connector tests

* Fix prettier

* Fix mypy

* Address JR comments
2025-06-22 13:54:04 -07:00
Rei Meguro
045b7cc7e2 feat: comma separated citations (#4923)
* feat: comam separated citations

* nit

* fix

* fix: comment
2025-06-21 22:51:32 +00:00
joachim-danswer
970e07a93b Forcing vespa language 2025-06-21 16:12:13 -07:00
joachim-danswer
d463a3f213 KG Updates (#4925)
* updates

 - no classification if deep extraction is False
 - separate names for views in LLM generation
 - better prompts
 - any relationship type provided to LLM that relates to identified entities

* CW feedback/comment update
2025-06-21 20:16:39 +00:00
Wenxi
4ba44c5e48 Fix no subject gmail docs (#4922)
Co-authored-by: Wenxi Onyx <wenxi-onyx@Wenxis-MacBook-Pro.local>
2025-06-20 23:22:49 +00:00
Chris Weaver
6f8176092e S3 like file store (#4897)
* Move to an S3-like file store

* Add non-mocked test

* Add S3 tests

* Improve migration / add auto-running tests

* Refactor

* Fix mypy

* Small fixes

* Improve migration to handle downgrades

* fix file store tests

* Fix file store tests again

* Fix file store tests again

* Fix mypy

* Fix default values

* Add MinIO to other compose files

* Working helm w/ minio

* Fix test

* Address greptile comments

* Harden migration

* Fix README

* Fix it

* Address more greptile comments

* Fix it

* Rebase

* Handle multi-tenant case

* Fix mypy

* Fix test

* fix test

* Improve migration

* Fix test
2025-06-20 14:22:05 -07:00
Wenxi
198ec417ba fix gemini model names + add vertex claude sonnet 4 (#4920)
* fix gemini model names + add vertex claude sonnet 4

* few more models

---------

Co-authored-by: Wenxi Onyx <wenxi-onyx@Wenxis-MacBook-Pro.local>
2025-06-20 18:18:36 +00:00
Wenxi
fbdf7798cf GCS metadata processing (#4879)
* GCS metadata processing

* Unprocessable files should still be indexed to be searched by title

* Moved re-used logic to utils. Combined file metadata PR with GCS metadata changes

* Added OnyxMetadata type, adjusted timestamp naming consistency, clarified timestamp logic

* Use BaseModel

---------

Co-authored-by: Wenxi Onyx <wenxi-onyx@Wenxis-MacBook-Pro.local>
2025-06-20 16:11:38 +00:00
Weves
7bd9c856aa Really add psql to api-server 2025-06-19 18:50:17 -07:00
Rei Meguro
948c719d73 fix (#4915) 2025-06-19 23:06:34 +00:00
335 changed files with 11016 additions and 4428 deletions

View File

@@ -0,0 +1,86 @@
name: External Dependency Unit Tests
on:
merge_group:
pull_request:
branches: [main]
env:
# AWS
S3_AWS_ACCESS_KEY_ID: ${{ secrets.S3_AWS_ACCESS_KEY_ID }}
S3_AWS_SECRET_ACCESS_KEY: ${{ secrets.S3_AWS_SECRET_ACCESS_KEY }}
# MinIO
S3_ENDPOINT_URL: "http://localhost:9004"
jobs:
discover-test-dirs:
runs-on: ubuntu-latest
outputs:
test-dirs: ${{ steps.set-matrix.outputs.test-dirs }}
steps:
- name: Checkout code
uses: actions/checkout@v4
- name: Discover test directories
id: set-matrix
run: |
# Find all subdirectories in backend/tests/external_dependency_unit
dirs=$(find backend/tests/external_dependency_unit -mindepth 1 -maxdepth 1 -type d -exec basename {} \; | sort | jq -R -s -c 'split("\n")[:-1]')
echo "test-dirs=$dirs" >> $GITHUB_OUTPUT
external-dependency-unit-tests:
needs: discover-test-dirs
# See https://runs-on.com/runners/linux/
runs-on: [runs-on, runner=8cpu-linux-x64, "run-id=${{ github.run_id }}"]
strategy:
fail-fast: false
matrix:
test-dir: ${{ fromJson(needs.discover-test-dirs.outputs.test-dirs) }}
env:
PYTHONPATH: ./backend
steps:
- name: Checkout code
uses: actions/checkout@v4
- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: "3.11"
cache: "pip"
cache-dependency-path: |
backend/requirements/default.txt
backend/requirements/dev.txt
- name: Install Dependencies
run: |
python -m pip install --upgrade pip
pip install --retries 5 --timeout 30 -r backend/requirements/default.txt
pip install --retries 5 --timeout 30 -r backend/requirements/dev.txt
playwright install chromium
playwright install-deps chromium
- name: Set up Standard Dependencies
run: |
cd deployment/docker_compose
docker compose -f docker-compose.dev.yml -p onyx-stack up -d minio relational_db cache index
- name: Run migrations
run: |
cd backend
alembic upgrade head
- name: Run Tests for ${{ matrix.test-dir }}
shell: script -q -e -c "bash --noprofile --norc -eo pipefail {0}"
run: |
py.test \
-n 8 \
--dist loadfile \
--durations=8 \
-o junit_family=xunit2 \
-xv \
--ff \
backend/tests/external_dependency_unit/${{ matrix.test-dir }}

View File

@@ -16,6 +16,9 @@ env:
CONFLUENCE_TEST_SPACE_URL: ${{ secrets.CONFLUENCE_TEST_SPACE_URL }}
CONFLUENCE_USER_NAME: ${{ secrets.CONFLUENCE_USER_NAME }}
CONFLUENCE_ACCESS_TOKEN: ${{ secrets.CONFLUENCE_ACCESS_TOKEN }}
JIRA_BASE_URL: ${{ secrets.JIRA_BASE_URL }}
JIRA_USER_EMAIL: ${{ secrets.JIRA_USER_EMAIL }}
JIRA_API_TOKEN: ${{ secrets.JIRA_API_TOKEN }}
PLATFORM_PAIR: linux-amd64
jobs:
@@ -266,6 +269,9 @@ jobs:
-e CONFLUENCE_TEST_SPACE_URL=${CONFLUENCE_TEST_SPACE_URL} \
-e CONFLUENCE_USER_NAME=${CONFLUENCE_USER_NAME} \
-e CONFLUENCE_ACCESS_TOKEN=${CONFLUENCE_ACCESS_TOKEN} \
-e JIRA_BASE_URL=${JIRA_BASE_URL} \
-e JIRA_USER_EMAIL=${JIRA_USER_EMAIL} \
-e JIRA_API_TOKEN=${JIRA_API_TOKEN} \
-e TEST_WEB_HOSTNAME=test-runner \
-e MOCK_CONNECTOR_SERVER_HOST=mock_connector_server \
-e MOCK_CONNECTOR_SERVER_PORT=8001 \

View File

@@ -16,6 +16,9 @@ env:
CONFLUENCE_TEST_SPACE_URL: ${{ secrets.CONFLUENCE_TEST_SPACE_URL }}
CONFLUENCE_USER_NAME: ${{ secrets.CONFLUENCE_USER_NAME }}
CONFLUENCE_ACCESS_TOKEN: ${{ secrets.CONFLUENCE_ACCESS_TOKEN }}
JIRA_BASE_URL: ${{ secrets.JIRA_BASE_URL }}
JIRA_USER_EMAIL: ${{ secrets.JIRA_USER_EMAIL }}
JIRA_API_TOKEN: ${{ secrets.JIRA_API_TOKEN }}
PLATFORM_PAIR: linux-amd64
jobs:
integration-tests-mit:
@@ -201,6 +204,9 @@ jobs:
-e CONFLUENCE_TEST_SPACE_URL=${CONFLUENCE_TEST_SPACE_URL} \
-e CONFLUENCE_USER_NAME=${CONFLUENCE_USER_NAME} \
-e CONFLUENCE_ACCESS_TOKEN=${CONFLUENCE_ACCESS_TOKEN} \
-e JIRA_BASE_URL=${JIRA_BASE_URL} \
-e JIRA_USER_EMAIL=${JIRA_USER_EMAIL} \
-e JIRA_API_TOKEN=${JIRA_API_TOKEN} \
-e TEST_WEB_HOSTNAME=test-runner \
-e MOCK_CONNECTOR_SERVER_HOST=mock_connector_server \
-e MOCK_CONNECTOR_SERVER_PORT=8001 \

View File

@@ -22,6 +22,7 @@ env:
CONFLUENCE_ACCESS_TOKEN: ${{ secrets.CONFLUENCE_ACCESS_TOKEN }}
# Jira
JIRA_BASE_URL: ${{ secrets.JIRA_BASE_URL }}
JIRA_USER_EMAIL: ${{ secrets.JIRA_USER_EMAIL }}
JIRA_API_TOKEN: ${{ secrets.JIRA_API_TOKEN }}
@@ -49,6 +50,9 @@ env:
SF_PASSWORD: ${{ secrets.SF_PASSWORD }}
SF_SECURITY_TOKEN: ${{ secrets.SF_SECURITY_TOKEN }}
# Hubspot
HUBSPOT_ACCESS_TOKEN: ${{ secrets.HUBSPOT_ACCESS_TOKEN }}
# Airtable
AIRTABLE_TEST_BASE_ID: ${{ secrets.AIRTABLE_TEST_BASE_ID }}
AIRTABLE_TEST_TABLE_ID: ${{ secrets.AIRTABLE_TEST_TABLE_ID }}

View File

@@ -58,3 +58,9 @@ AGENT_RETRIEVAL_STATS=False # Note: This setting will incur substantial re-ran
AGENT_RERANKING_STATS=True
AGENT_MAX_QUERY_RETRIEVAL_RESULTS=20
AGENT_RERANKING_MAX_QUERY_RETRIEVAL_RESULTS=20
# S3 File Store Configuration (MinIO for local development)
S3_ENDPOINT_URL=http://localhost:9004
S3_FILE_STORE_BUCKET_NAME=onyx-file-store-bucket
S3_AWS_ACCESS_KEY_ID=minioadmin
S3_AWS_SECRET_ACCESS_KEY=minioadmin

View File

@@ -37,8 +37,7 @@ RUN apt-get update && \
pkg-config \
gcc \
nano \
vim \
postgresql-client && \
vim && \
rm -rf /var/lib/apt/lists/* && \
apt-get clean
@@ -78,6 +77,9 @@ RUN apt-get update && \
rm -rf /var/lib/apt/lists/* && \
rm -f /usr/local/lib/python3.11/site-packages/tornado/test/test.key
# Install postgresql-client for easy manual tests
# Install it here to avoid it being cleaned up above
RUN apt-get update && apt-get install -y postgresql-client
# Pre-downloading models for setups with limited egress
RUN python -c "from tokenizers import Tokenizer; \

View File

@@ -20,3 +20,44 @@ To run all un-applied migrations:
To undo migrations:
`alembic downgrade -X`
where X is the number of migrations you want to undo from the current state
### Multi-tenant migrations
For multi-tenant deployments, you can use additional options:
**Upgrade all tenants:**
```bash
alembic -x upgrade_all_tenants=true upgrade head
```
**Upgrade specific schemas:**
```bash
# Single schema
alembic -x schemas=tenant_12345678-1234-1234-1234-123456789012 upgrade head
# Multiple schemas (comma-separated)
alembic -x schemas=tenant_12345678-1234-1234-1234-123456789012,public,another_tenant upgrade head
```
**Upgrade tenants within an alphabetical range:**
```bash
# Upgrade tenants 100-200 when sorted alphabetically (positions 100 to 200)
alembic -x upgrade_all_tenants=true -x tenant_range_start=100 -x tenant_range_end=200 upgrade head
# Upgrade tenants starting from position 1000 alphabetically
alembic -x upgrade_all_tenants=true -x tenant_range_start=1000 upgrade head
# Upgrade first 500 tenants alphabetically
alembic -x upgrade_all_tenants=true -x tenant_range_end=500 upgrade head
```
**Continue on error (for batch operations):**
```bash
alembic -x upgrade_all_tenants=true -x continue=true upgrade head
```
The tenant range filtering works by:
1. Sorting tenant IDs alphabetically
2. Using 1-based position numbers (1st, 2nd, 3rd tenant, etc.)
3. Filtering to the specified range of positions
4. Non-tenant schemas (like 'public') are always included

View File

@@ -1,12 +1,12 @@
from typing import Any, Literal
from onyx.db.engine import get_iam_auth_token
from onyx.db.engine.iam_auth import get_iam_auth_token
from onyx.configs.app_configs import USE_IAM_AUTH
from onyx.configs.app_configs import POSTGRES_HOST
from onyx.configs.app_configs import POSTGRES_PORT
from onyx.configs.app_configs import POSTGRES_USER
from onyx.configs.app_configs import AWS_REGION_NAME
from onyx.db.engine import build_connection_string
from onyx.db.engine import get_all_tenant_ids
from onyx.db.engine.sql_engine import build_connection_string
from onyx.db.engine.tenant_utils import get_all_tenant_ids
from sqlalchemy import event
from sqlalchemy import pool
from sqlalchemy import text
@@ -21,10 +21,14 @@ from alembic import context
from sqlalchemy.ext.asyncio import create_async_engine
from sqlalchemy.sql.schema import SchemaItem
from onyx.configs.constants import SSL_CERT_FILE
from shared_configs.configs import MULTI_TENANT, POSTGRES_DEFAULT_SCHEMA
from shared_configs.configs import (
MULTI_TENANT,
POSTGRES_DEFAULT_SCHEMA_STANDARD_VALUE,
TENANT_ID_PREFIX,
)
from onyx.db.models import Base
from celery.backends.database.session import ResultModelBase # type: ignore
from onyx.db.engine import SqlEngine
from onyx.db.engine.sql_engine import SqlEngine
# Make sure in alembic.ini [logger_root] level=INFO is set or most logging will be
# hidden! (defaults to level=WARN)
@@ -69,15 +73,67 @@ def include_object(
return True
def get_schema_options() -> tuple[str, bool, bool, bool]:
def filter_tenants_by_range(
tenant_ids: list[str], start_range: int | None = None, end_range: int | None = None
) -> list[str]:
"""
Filter tenant IDs by alphabetical position range.
Args:
tenant_ids: List of tenant IDs to filter
start_range: Starting position in alphabetically sorted list (1-based, inclusive)
end_range: Ending position in alphabetically sorted list (1-based, inclusive)
Returns:
Filtered list of tenant IDs in their original order
"""
if start_range is None and end_range is None:
return tenant_ids
# Separate tenant IDs from non-tenant schemas
tenant_schemas = [tid for tid in tenant_ids if tid.startswith(TENANT_ID_PREFIX)]
non_tenant_schemas = [
tid for tid in tenant_ids if not tid.startswith(TENANT_ID_PREFIX)
]
# Sort tenant schemas alphabetically.
# NOTE: can cause missed schemas if a schema is created in between workers
# fetching of all tenant IDs. We accept this risk for now. Just re-running
# the migration will fix the issue.
sorted_tenant_schemas = sorted(tenant_schemas)
# Apply range filtering (0-based indexing)
start_idx = start_range if start_range is not None else 0
end_idx = end_range if end_range is not None else len(sorted_tenant_schemas)
# Ensure indices are within bounds
start_idx = max(0, start_idx)
end_idx = min(len(sorted_tenant_schemas), end_idx)
# Get the filtered tenant schemas
filtered_tenant_schemas = sorted_tenant_schemas[start_idx:end_idx]
# Combine with non-tenant schemas and preserve original order
filtered_tenants = []
for tenant_id in tenant_ids:
if tenant_id in filtered_tenant_schemas or tenant_id in non_tenant_schemas:
filtered_tenants.append(tenant_id)
return filtered_tenants
def get_schema_options() -> (
tuple[bool, bool, bool, int | None, int | None, list[str] | None]
):
x_args_raw = context.get_x_argument()
x_args = {}
for arg in x_args_raw:
for pair in arg.split(","):
if "=" in pair:
key, value = pair.split("=", 1)
x_args[key.strip()] = value.strip()
schema_name = x_args.get("schema", POSTGRES_DEFAULT_SCHEMA)
if "=" in arg:
key, value = arg.split("=", 1)
x_args[key.strip()] = value.strip()
else:
raise ValueError(f"Invalid argument: {arg}")
create_schema = x_args.get("create_schema", "true").lower() == "true"
upgrade_all_tenants = x_args.get("upgrade_all_tenants", "false").lower() == "true"
@@ -85,17 +141,81 @@ def get_schema_options() -> tuple[str, bool, bool, bool]:
# only applies to online migrations
continue_on_error = x_args.get("continue", "false").lower() == "true"
if (
MULTI_TENANT
and schema_name == POSTGRES_DEFAULT_SCHEMA
and not upgrade_all_tenants
):
# Tenant range filtering
tenant_range_start = None
tenant_range_end = None
if "tenant_range_start" in x_args:
try:
tenant_range_start = int(x_args["tenant_range_start"])
except ValueError:
raise ValueError(
f"Invalid tenant_range_start value: {x_args['tenant_range_start']}. Must be an integer."
)
if "tenant_range_end" in x_args:
try:
tenant_range_end = int(x_args["tenant_range_end"])
except ValueError:
raise ValueError(
f"Invalid tenant_range_end value: {x_args['tenant_range_end']}. Must be an integer."
)
# Validate range
if tenant_range_start is not None and tenant_range_end is not None:
if tenant_range_start > tenant_range_end:
raise ValueError(
f"tenant_range_start ({tenant_range_start}) cannot be greater than tenant_range_end ({tenant_range_end})"
)
# Specific schema names filtering (replaces both schema_name and the old tenant_ids approach)
schemas = None
if "schemas" in x_args:
schema_names_str = x_args["schemas"].strip()
if schema_names_str:
# Split by comma and strip whitespace
schemas = [
name.strip() for name in schema_names_str.split(",") if name.strip()
]
if schemas:
logger.info(f"Specific schema names specified: {schemas}")
# Validate that only one method is used at a time
range_filtering = tenant_range_start is not None or tenant_range_end is not None
specific_filtering = schemas is not None and len(schemas) > 0
if range_filtering and specific_filtering:
raise ValueError(
"Cannot run default migrations in public schema when multi-tenancy is enabled. "
"Please specify a tenant-specific schema."
"Cannot use both tenant range filtering (tenant_range_start/tenant_range_end) "
"and specific schema filtering (schemas) at the same time. "
"Please use only one filtering method."
)
return schema_name, create_schema, upgrade_all_tenants, continue_on_error
if upgrade_all_tenants and specific_filtering:
raise ValueError(
"Cannot use both upgrade_all_tenants=true and schemas at the same time. "
"Use either upgrade_all_tenants=true for all tenants, or schemas for specific schemas."
)
# If any filtering parameters are specified, we're not doing the default single schema migration
if range_filtering:
upgrade_all_tenants = True
# Validate multi-tenant requirements
if MULTI_TENANT and not upgrade_all_tenants and not specific_filtering:
raise ValueError(
"In multi-tenant mode, you must specify either upgrade_all_tenants=true "
"or provide schemas. Cannot run default migration."
)
return (
create_schema,
upgrade_all_tenants,
continue_on_error,
tenant_range_start,
tenant_range_end,
schemas,
)
def do_run_migrations(
@@ -142,12 +262,17 @@ def provide_iam_token_for_alembic(
async def run_async_migrations() -> None:
(
schema_name,
create_schema,
upgrade_all_tenants,
continue_on_error,
tenant_range_start,
tenant_range_end,
schemas,
) = get_schema_options()
if not schemas and not MULTI_TENANT:
schemas = [POSTGRES_DEFAULT_SCHEMA_STANDARD_VALUE]
# without init_engine, subsequent engine calls fail hard intentionally
SqlEngine.init_engine(pool_size=20, max_overflow=5)
@@ -164,12 +289,50 @@ async def run_async_migrations() -> None:
) -> None:
provide_iam_token_for_alembic(dialect, conn_rec, cargs, cparams)
if upgrade_all_tenants:
if schemas:
# Use specific schema names directly without fetching all tenants
logger.info(f"Migrating specific schema names: {schemas}")
i_schema = 0
num_schemas = len(schemas)
for schema in schemas:
i_schema += 1
logger.info(
f"Migrating schema: index={i_schema} num_schemas={num_schemas} schema={schema}"
)
try:
async with engine.connect() as connection:
await connection.run_sync(
do_run_migrations,
schema_name=schema,
create_schema=create_schema,
)
except Exception as e:
logger.error(f"Error migrating schema {schema}: {e}")
if not continue_on_error:
logger.error("--continue=true is not set, raising exception!")
raise
logger.warning("--continue=true is set, continuing to next schema.")
elif upgrade_all_tenants:
tenant_schemas = get_all_tenant_ids()
filtered_tenant_schemas = filter_tenants_by_range(
tenant_schemas, tenant_range_start, tenant_range_end
)
if tenant_range_start is not None or tenant_range_end is not None:
logger.info(
f"Filtering tenants by range: start={tenant_range_start}, end={tenant_range_end}"
)
logger.info(
f"Total tenants: {len(tenant_schemas)}, Filtered tenants: {len(filtered_tenant_schemas)}"
)
i_tenant = 0
num_tenants = len(tenant_schemas)
for schema in tenant_schemas:
num_tenants = len(filtered_tenant_schemas)
for schema in filtered_tenant_schemas:
i_tenant += 1
logger.info(
f"Migrating schema: index={i_tenant} num_tenants={num_tenants} schema={schema}"
@@ -190,17 +353,13 @@ async def run_async_migrations() -> None:
logger.warning("--continue=true is set, continuing to next schema.")
else:
try:
logger.info(f"Migrating schema: {schema_name}")
async with engine.connect() as connection:
await connection.run_sync(
do_run_migrations,
schema_name=schema_name,
create_schema=create_schema,
)
except Exception as e:
logger.error(f"Error migrating schema {schema_name}: {e}")
raise
# This should not happen in the new design since we require either
# upgrade_all_tenants=true or schemas in multi-tenant mode
# and for non-multi-tenant mode, we should use schemas with the default schema
raise ValueError(
"No migration target specified. Use either upgrade_all_tenants=true for all tenants "
"or schemas for specific schemas."
)
await engine.dispose()
@@ -221,10 +380,37 @@ def run_migrations_offline() -> None:
# without init_engine, subsequent engine calls fail hard intentionally
SqlEngine.init_engine(pool_size=20, max_overflow=5)
schema_name, _, upgrade_all_tenants, continue_on_error = get_schema_options()
(
create_schema,
upgrade_all_tenants,
continue_on_error,
tenant_range_start,
tenant_range_end,
schemas,
) = get_schema_options()
url = build_connection_string()
if upgrade_all_tenants:
if schemas:
# Use specific schema names directly without fetching all tenants
logger.info(f"Migrating specific schema names: {schemas}")
for schema in schemas:
logger.info(f"Migrating schema: {schema}")
context.configure(
url=url,
target_metadata=target_metadata, # type: ignore
literal_binds=True,
include_object=include_object,
version_table_schema=schema,
include_schemas=True,
script_location=config.get_main_option("script_location"),
dialect_opts={"paramstyle": "named"},
)
with context.begin_transaction():
context.run_migrations()
elif upgrade_all_tenants:
engine = create_async_engine(url)
if USE_IAM_AUTH:
@@ -238,7 +424,19 @@ def run_migrations_offline() -> None:
tenant_schemas = get_all_tenant_ids()
engine.sync_engine.dispose()
for schema in tenant_schemas:
filtered_tenant_schemas = filter_tenants_by_range(
tenant_schemas, tenant_range_start, tenant_range_end
)
if tenant_range_start is not None or tenant_range_end is not None:
logger.info(
f"Filtering tenants by range: start={tenant_range_start}, end={tenant_range_end}"
)
logger.info(
f"Total tenants: {len(tenant_schemas)}, Filtered tenants: {len(filtered_tenant_schemas)}"
)
for schema in filtered_tenant_schemas:
logger.info(f"Migrating schema: {schema}")
context.configure(
url=url,
@@ -254,21 +452,12 @@ def run_migrations_offline() -> None:
with context.begin_transaction():
context.run_migrations()
else:
logger.info(f"Migrating schema: {schema_name}")
context.configure(
url=url,
target_metadata=target_metadata, # type: ignore
literal_binds=True,
include_object=include_object,
version_table_schema=schema_name,
include_schemas=True,
script_location=config.get_main_option("script_location"),
dialect_opts={"paramstyle": "named"},
# This should not happen in the new design
raise ValueError(
"No migration target specified. Use either upgrade_all_tenants=true for all tenants "
"or schemas for specific schemas."
)
with context.begin_transaction():
context.run_migrations()
def run_migrations_online() -> None:
logger.info("run_migrations_online starting.")

View File

@@ -0,0 +1,136 @@
"""update_kg_trigger_functions
Revision ID: 36e9220ab794
Revises: c9e2cd766c29
Create Date: 2025-06-22 17:33:25.833733
"""
from alembic import op
from sqlalchemy.orm import Session
from sqlalchemy import text
from shared_configs.configs import POSTGRES_DEFAULT_SCHEMA_STANDARD_VALUE
# revision identifiers, used by Alembic.
revision = "36e9220ab794"
down_revision = "c9e2cd766c29"
branch_labels = None
depends_on = None
def _get_tenant_contextvar(session: Session) -> str:
"""Get the current schema for the migration"""
current_tenant = session.execute(text("SELECT current_schema()")).scalar()
if isinstance(current_tenant, str):
return current_tenant
else:
raise ValueError("Current tenant is not a string")
def upgrade() -> None:
bind = op.get_bind()
session = Session(bind=bind)
# Create kg_entity trigger to update kg_entity.name and its trigrams
tenant_id = _get_tenant_contextvar(session)
alphanum_pattern = r"[^a-z0-9]+"
truncate_length = 1000
function = "update_kg_entity_name"
op.execute(
text(
f"""
CREATE OR REPLACE FUNCTION "{tenant_id}".{function}()
RETURNS TRIGGER AS $$
DECLARE
name text;
cleaned_name text;
BEGIN
-- Set name to semantic_id if document_id is not NULL
IF NEW.document_id IS NOT NULL THEN
SELECT lower(semantic_id) INTO name
FROM "{tenant_id}".document
WHERE id = NEW.document_id;
ELSE
name = lower(NEW.name);
END IF;
-- Clean name and truncate if too long
cleaned_name = regexp_replace(
name,
'{alphanum_pattern}', '', 'g'
);
IF length(cleaned_name) > {truncate_length} THEN
cleaned_name = left(cleaned_name, {truncate_length});
END IF;
-- Set name and name trigrams
NEW.name = name;
NEW.name_trigrams = {POSTGRES_DEFAULT_SCHEMA_STANDARD_VALUE}.show_trgm(cleaned_name);
RETURN NEW;
END;
$$ LANGUAGE plpgsql;
"""
)
)
trigger = f"{function}_trigger"
op.execute(f'DROP TRIGGER IF EXISTS {trigger} ON "{tenant_id}".kg_entity')
op.execute(
f"""
CREATE TRIGGER {trigger}
BEFORE INSERT OR UPDATE OF name
ON "{tenant_id}".kg_entity
FOR EACH ROW
EXECUTE FUNCTION "{tenant_id}".{function}();
"""
)
# Create kg_entity trigger to update kg_entity.name and its trigrams
function = "update_kg_entity_name_from_doc"
op.execute(
text(
f"""
CREATE OR REPLACE FUNCTION "{tenant_id}".{function}()
RETURNS TRIGGER AS $$
DECLARE
doc_name text;
cleaned_name text;
BEGIN
doc_name = lower(NEW.semantic_id);
-- Clean name and truncate if too long
cleaned_name = regexp_replace(
doc_name,
'{alphanum_pattern}', '', 'g'
);
IF length(cleaned_name) > {truncate_length} THEN
cleaned_name = left(cleaned_name, {truncate_length});
END IF;
-- Set name and name trigrams for all entities referencing this document
UPDATE "{tenant_id}".kg_entity
SET
name = doc_name,
name_trigrams = {POSTGRES_DEFAULT_SCHEMA_STANDARD_VALUE}.show_trgm(cleaned_name)
WHERE document_id = NEW.id;
RETURN NEW;
END;
$$ LANGUAGE plpgsql;
"""
)
)
trigger = f"{function}_trigger"
op.execute(f'DROP TRIGGER IF EXISTS {trigger} ON "{tenant_id}".document')
op.execute(
f"""
CREATE TRIGGER {trigger}
AFTER UPDATE OF semantic_id
ON "{tenant_id}".document
FOR EACH ROW
EXECUTE FUNCTION "{tenant_id}".{function}();
"""
)
def downgrade() -> None:
pass

View File

@@ -21,22 +21,14 @@ depends_on = None
# an outage by creating an index without using CONCURRENTLY. This migration:
#
# 1. Creates more efficient full-text search capabilities using tsvector columns and GIN indexes
# 2. Uses CONCURRENTLY for all index creation to prevent table locking
# 3. Explicitly manages transactions with COMMIT statements to allow CONCURRENTLY to work
# (see: https://www.postgresql.org/docs/9.4/sql-createindex.html#SQL-CREATEINDEX-CONCURRENTLY)
# (see: https://github.com/sqlalchemy/alembic/issues/277)
# 4. Adds indexes to both chat_message and chat_session tables for comprehensive search
# 2. Adds indexes to both chat_message and chat_session tables for comprehensive search
# 3. Note: CONCURRENTLY was removed due to operational issues
def upgrade() -> None:
# First, drop any existing indexes to avoid conflicts
op.execute("COMMIT")
op.execute("DROP INDEX CONCURRENTLY IF EXISTS idx_chat_message_tsv;")
op.execute("COMMIT")
op.execute("DROP INDEX CONCURRENTLY IF EXISTS idx_chat_session_desc_tsv;")
op.execute("COMMIT")
op.execute("DROP INDEX IF EXISTS idx_chat_message_tsv;")
op.execute("DROP INDEX IF EXISTS idx_chat_session_desc_tsv;")
op.execute("DROP INDEX IF EXISTS idx_chat_message_message_lower;")
# Drop existing columns if they exist
@@ -52,12 +44,9 @@ def upgrade() -> None:
"""
)
# Commit the current transaction before creating concurrent indexes
op.execute("COMMIT")
op.execute(
"""
CREATE INDEX CONCURRENTLY IF NOT EXISTS idx_chat_message_tsv
CREATE INDEX IF NOT EXISTS idx_chat_message_tsv
ON chat_message
USING GIN (message_tsv)
"""
@@ -72,12 +61,9 @@ def upgrade() -> None:
"""
)
# Commit again before creating the second concurrent index
op.execute("COMMIT")
op.execute(
"""
CREATE INDEX CONCURRENTLY IF NOT EXISTS idx_chat_session_desc_tsv
CREATE INDEX IF NOT EXISTS idx_chat_session_desc_tsv
ON chat_session
USING GIN (description_tsv)
"""
@@ -85,12 +71,9 @@ def upgrade() -> None:
def downgrade() -> None:
# Drop the indexes first (use CONCURRENTLY for dropping too)
op.execute("COMMIT")
op.execute("DROP INDEX CONCURRENTLY IF EXISTS idx_chat_message_tsv;")
op.execute("COMMIT")
op.execute("DROP INDEX CONCURRENTLY IF EXISTS idx_chat_session_desc_tsv;")
# Drop the indexes first
op.execute("DROP INDEX IF EXISTS idx_chat_message_tsv;")
op.execute("DROP INDEX IF EXISTS idx_chat_session_desc_tsv;")
# Then drop the columns
op.execute("ALTER TABLE chat_message DROP COLUMN IF EXISTS message_tsv;")

View File

@@ -467,11 +467,11 @@ def upgrade() -> None:
# Create GIN index for clustering and normalization
op.execute(
"CREATE INDEX CONCURRENTLY IF NOT EXISTS idx_kg_entity_clustering_trigrams "
"CREATE INDEX IF NOT EXISTS idx_kg_entity_clustering_trigrams "
f"ON kg_entity USING GIN (name {POSTGRES_DEFAULT_SCHEMA_STANDARD_VALUE}.gin_trgm_ops)"
)
op.execute(
"CREATE INDEX CONCURRENTLY IF NOT EXISTS idx_kg_entity_normalization_trigrams "
"CREATE INDEX IF NOT EXISTS idx_kg_entity_normalization_trigrams "
"ON kg_entity USING GIN (name_trigrams)"
)
@@ -625,9 +625,8 @@ def downgrade() -> None:
op.execute(f"DROP FUNCTION IF EXISTS {function}()")
# Drop index
op.execute("COMMIT") # Commit to allow CONCURRENTLY
op.execute("DROP INDEX CONCURRENTLY IF EXISTS idx_kg_entity_clustering_trigrams")
op.execute("DROP INDEX CONCURRENTLY IF EXISTS idx_kg_entity_normalization_trigrams")
op.execute("DROP INDEX IF EXISTS idx_kg_entity_clustering_trigrams")
op.execute("DROP INDEX IF EXISTS idx_kg_entity_normalization_trigrams")
# Drop tables in reverse order of creation to handle dependencies
op.drop_table("kg_term")

View File

@@ -0,0 +1,90 @@
"""add stale column to external user group tables
Revision ID: 58c50ef19f08
Revises: 7b9b952abdf6
Create Date: 2025-06-25 14:08:14.162380
"""
from alembic import op
import sqlalchemy as sa
# revision identifiers, used by Alembic.
revision = "58c50ef19f08"
down_revision = "7b9b952abdf6"
branch_labels = None
depends_on = None
def upgrade() -> None:
# Add the stale column with default value False to user__external_user_group_id
op.add_column(
"user__external_user_group_id",
sa.Column("stale", sa.Boolean(), nullable=False, server_default="false"),
)
# Create index for efficient querying of stale rows by cc_pair_id
op.create_index(
"ix_user__external_user_group_id_cc_pair_id_stale",
"user__external_user_group_id",
["cc_pair_id", "stale"],
unique=False,
)
# Create index for efficient querying of all stale rows
op.create_index(
"ix_user__external_user_group_id_stale",
"user__external_user_group_id",
["stale"],
unique=False,
)
# Add the stale column with default value False to public_external_user_group
op.add_column(
"public_external_user_group",
sa.Column("stale", sa.Boolean(), nullable=False, server_default="false"),
)
# Create index for efficient querying of stale rows by cc_pair_id
op.create_index(
"ix_public_external_user_group_cc_pair_id_stale",
"public_external_user_group",
["cc_pair_id", "stale"],
unique=False,
)
# Create index for efficient querying of all stale rows
op.create_index(
"ix_public_external_user_group_stale",
"public_external_user_group",
["stale"],
unique=False,
)
def downgrade() -> None:
# Drop the indices for public_external_user_group first
op.drop_index(
"ix_public_external_user_group_stale", table_name="public_external_user_group"
)
op.drop_index(
"ix_public_external_user_group_cc_pair_id_stale",
table_name="public_external_user_group",
)
# Drop the stale column from public_external_user_group
op.drop_column("public_external_user_group", "stale")
# Drop the indices for user__external_user_group_id
op.drop_index(
"ix_user__external_user_group_id_stale",
table_name="user__external_user_group_id",
)
op.drop_index(
"ix_user__external_user_group_id_cc_pair_id_stale",
table_name="user__external_user_group_id",
)
# Drop the stale column from user__external_user_group_id
op.drop_column("user__external_user_group_id", "stale")

View File

@@ -0,0 +1,318 @@
"""update-entities
Revision ID: 7b9b952abdf6
Revises: 36e9220ab794
Create Date: 2025-06-23 20:24:08.139201
"""
import json
from alembic import op
import sqlalchemy as sa
# revision identifiers, used by Alembic.
revision = "7b9b952abdf6"
down_revision = "36e9220ab794"
branch_labels = None
depends_on = None
def upgrade() -> None:
conn = op.get_bind()
# new entity type metadata_attribute_conversion
new_entity_type_conversion = {
"LINEAR": {
"team": {"name": "team", "keep": True, "implication_property": None},
"state": {"name": "state", "keep": True, "implication_property": None},
"priority": {
"name": "priority",
"keep": True,
"implication_property": None,
},
"estimate": {
"name": "estimate",
"keep": True,
"implication_property": None,
},
"created_at": {
"name": "created_at",
"keep": True,
"implication_property": None,
},
"started_at": {
"name": "started_at",
"keep": True,
"implication_property": None,
},
"completed_at": {
"name": "completed_at",
"keep": True,
"implication_property": None,
},
"due_date": {
"name": "due_date",
"keep": True,
"implication_property": None,
},
"creator": {
"name": "creator",
"keep": False,
"implication_property": {
"implied_entity_type": "from_email",
"implied_relationship_name": "is_creator_of",
},
},
"assignee": {
"name": "assignee",
"keep": False,
"implication_property": {
"implied_entity_type": "from_email",
"implied_relationship_name": "is_assignee_of",
},
},
},
"JIRA": {
"issuetype": {
"name": "subtype",
"keep": True,
"implication_property": None,
},
"status": {"name": "status", "keep": True, "implication_property": None},
"priority": {
"name": "priority",
"keep": True,
"implication_property": None,
},
"project_name": {
"name": "project",
"keep": True,
"implication_property": None,
},
"created": {
"name": "created_at",
"keep": True,
"implication_property": None,
},
"updated": {
"name": "updated_at",
"keep": True,
"implication_property": None,
},
"resolution_date": {
"name": "completed_at",
"keep": True,
"implication_property": None,
},
"duedate": {"name": "due_date", "keep": True, "implication_property": None},
"reporter_email": {
"name": "creator",
"keep": False,
"implication_property": {
"implied_entity_type": "from_email",
"implied_relationship_name": "is_creator_of",
},
},
"assignee_email": {
"name": "assignee",
"keep": False,
"implication_property": {
"implied_entity_type": "from_email",
"implied_relationship_name": "is_assignee_of",
},
},
"key": {"name": "key", "keep": True, "implication_property": None},
"parent": {"name": "parent", "keep": True, "implication_property": None},
},
"GITHUB_PR": {
"repo": {"name": "repository", "keep": True, "implication_property": None},
"state": {"name": "state", "keep": True, "implication_property": None},
"num_commits": {
"name": "num_commits",
"keep": True,
"implication_property": None,
},
"num_files_changed": {
"name": "num_files_changed",
"keep": True,
"implication_property": None,
},
"labels": {"name": "labels", "keep": True, "implication_property": None},
"merged": {"name": "merged", "keep": True, "implication_property": None},
"merged_at": {
"name": "merged_at",
"keep": True,
"implication_property": None,
},
"closed_at": {
"name": "closed_at",
"keep": True,
"implication_property": None,
},
"created_at": {
"name": "created_at",
"keep": True,
"implication_property": None,
},
"updated_at": {
"name": "updated_at",
"keep": True,
"implication_property": None,
},
"user": {
"name": "creator",
"keep": False,
"implication_property": {
"implied_entity_type": "from_email",
"implied_relationship_name": "is_creator_of",
},
},
"assignees": {
"name": "assignees",
"keep": False,
"implication_property": {
"implied_entity_type": "from_email",
"implied_relationship_name": "is_assignee_of",
},
},
},
"GITHUB_ISSUE": {
"repo": {"name": "repository", "keep": True, "implication_property": None},
"state": {"name": "state", "keep": True, "implication_property": None},
"labels": {"name": "labels", "keep": True, "implication_property": None},
"closed_at": {
"name": "closed_at",
"keep": True,
"implication_property": None,
},
"created_at": {
"name": "created_at",
"keep": True,
"implication_property": None,
},
"updated_at": {
"name": "updated_at",
"keep": True,
"implication_property": None,
},
"user": {
"name": "creator",
"keep": False,
"implication_property": {
"implied_entity_type": "from_email",
"implied_relationship_name": "is_creator_of",
},
},
"assignees": {
"name": "assignees",
"keep": False,
"implication_property": {
"implied_entity_type": "from_email",
"implied_relationship_name": "is_assignee_of",
},
},
},
"FIREFLIES": {},
"ACCOUNT": {},
"OPPORTUNITY": {
"name": {"name": "name", "keep": True, "implication_property": None},
"stage_name": {"name": "stage", "keep": True, "implication_property": None},
"type": {"name": "type", "keep": True, "implication_property": None},
"amount": {"name": "amount", "keep": True, "implication_property": None},
"fiscal_year": {
"name": "fiscal_year",
"keep": True,
"implication_property": None,
},
"fiscal_quarter": {
"name": "fiscal_quarter",
"keep": True,
"implication_property": None,
},
"is_closed": {
"name": "is_closed",
"keep": True,
"implication_property": None,
},
"close_date": {
"name": "close_date",
"keep": True,
"implication_property": None,
},
"probability": {
"name": "close_probability",
"keep": True,
"implication_property": None,
},
"created_date": {
"name": "created_at",
"keep": True,
"implication_property": None,
},
"last_modified_date": {
"name": "updated_at",
"keep": True,
"implication_property": None,
},
"account": {
"name": "account",
"keep": False,
"implication_property": {
"implied_entity_type": "ACCOUNT",
"implied_relationship_name": "is_account_of",
},
},
},
"VENDOR": {},
"EMPLOYEE": {},
}
current_entity_types = conn.execute(
sa.text("SELECT id_name, attributes from kg_entity_type")
).all()
for entity_type, attributes in current_entity_types:
# delete removed entity types
if entity_type not in new_entity_type_conversion:
op.execute(
sa.text(f"DELETE FROM kg_entity_type WHERE id_name = '{entity_type}'")
)
continue
# update entity type attributes
if "metadata_attributes" in attributes:
del attributes["metadata_attributes"]
attributes["metadata_attribute_conversion"] = new_entity_type_conversion[
entity_type
]
attributes_str = json.dumps(attributes).replace("'", "''")
op.execute(
sa.text(
f"UPDATE kg_entity_type SET attributes = '{attributes_str}'"
f"WHERE id_name = '{entity_type}'"
),
)
def downgrade() -> None:
conn = op.get_bind()
current_entity_types = conn.execute(
sa.text("SELECT id_name, attributes from kg_entity_type")
).all()
for entity_type, attributes in current_entity_types:
conversion = {}
if "metadata_attribute_conversion" in attributes:
conversion = attributes.pop("metadata_attribute_conversion")
attributes["metadata_attributes"] = {
attr: prop["name"] for attr, prop in conversion.items() if prop["keep"]
}
attributes_str = json.dumps(attributes).replace("'", "''")
op.execute(
sa.text(
f"UPDATE kg_entity_type SET attributes = '{attributes_str}'"
f"WHERE id_name = '{entity_type}'"
),
)

View File

@@ -0,0 +1,312 @@
"""modify_file_store_for_external_storage
Revision ID: c9e2cd766c29
Revises: 03bf8be6b53a
Create Date: 2025-06-13 14:02:09.867679
"""
from alembic import op
import sqlalchemy as sa
from sqlalchemy.orm import Session
from sqlalchemy import text
from typing import cast, Any
from botocore.exceptions import ClientError
from onyx.db._deprecated.pg_file_store import delete_lobj_by_id, read_lobj
from onyx.file_store.file_store import get_s3_file_store
from shared_configs.contextvars import CURRENT_TENANT_ID_CONTEXTVAR
# revision identifiers, used by Alembic.
revision = "c9e2cd766c29"
down_revision = "03bf8be6b53a"
branch_labels = None
depends_on = None
def upgrade() -> None:
try:
# Modify existing file_store table to support external storage
op.rename_table("file_store", "file_record")
# Make lobj_oid nullable (for external storage files)
op.alter_column("file_record", "lobj_oid", nullable=True)
# Add external storage columns with generic names
op.add_column(
"file_record", sa.Column("bucket_name", sa.String(), nullable=True)
)
op.add_column(
"file_record", sa.Column("object_key", sa.String(), nullable=True)
)
# Add timestamps for tracking
op.add_column(
"file_record",
sa.Column(
"created_at",
sa.DateTime(timezone=True),
server_default=sa.func.now(),
nullable=False,
),
)
op.add_column(
"file_record",
sa.Column(
"updated_at",
sa.DateTime(timezone=True),
server_default=sa.func.now(),
nullable=False,
),
)
op.alter_column("file_record", "file_name", new_column_name="file_id")
except Exception as e:
if "does not exist" in str(e) or 'relation "file_store" does not exist' in str(
e
):
print(
f"Ran into error - {e}. Likely means we had a partial success in the past, continuing..."
)
else:
raise
print(
"External storage configured - migrating files from PostgreSQL to external storage..."
)
# if we fail midway through this, we'll have a partial success. Running the migration
# again should allow us to continue.
_migrate_files_to_external_storage()
print("File migration completed successfully!")
# Remove lobj_oid column
op.drop_column("file_record", "lobj_oid")
def downgrade() -> None:
"""Revert schema changes and migrate files from external storage back to PostgreSQL large objects."""
print(
"Reverting to PostgreSQL-backed file store migrating files from external storage …"
)
# 1. Ensure `lobj_oid` exists on the current `file_record` table (nullable for now).
op.add_column("file_record", sa.Column("lobj_oid", sa.Integer(), nullable=True))
# 2. Move content from external storage back into PostgreSQL large objects (table is still
# called `file_record` so application code continues to work during the copy).
try:
_migrate_files_to_postgres()
except Exception:
print("Error during downgrade migration, rolling back …")
op.drop_column("file_record", "lobj_oid")
raise
# 3. After migration every row should now have `lobj_oid` populated mark NOT NULL.
op.alter_column("file_record", "lobj_oid", nullable=False)
# 4. Remove columns that are only relevant to external storage.
op.drop_column("file_record", "updated_at")
op.drop_column("file_record", "created_at")
op.drop_column("file_record", "object_key")
op.drop_column("file_record", "bucket_name")
# 5. Rename `file_id` back to `file_name` (still on `file_record`).
op.alter_column("file_record", "file_id", new_column_name="file_name")
# 6. Finally, rename the table back to its original name expected by the legacy codebase.
op.rename_table("file_record", "file_store")
print(
"Downgrade migration completed files are now stored inside PostgreSQL again."
)
# -----------------------------------------------------------------------------
# Helper: migrate from external storage (S3/MinIO) back into PostgreSQL large objects
def _migrate_files_to_postgres() -> None:
"""Move any files whose content lives in external S3-compatible storage back into PostgreSQL.
The logic mirrors *inverse* of `_migrate_files_to_external_storage` used on upgrade.
"""
# Obtain DB session from Alembic context
bind = op.get_bind()
session = Session(bind=bind)
# Fetch rows that have external storage pointers (bucket/object_key not NULL)
result = session.execute(
text(
"SELECT file_id, bucket_name, object_key FROM file_record "
"WHERE bucket_name IS NOT NULL AND object_key IS NOT NULL"
)
)
files_to_migrate = [row[0] for row in result.fetchall()]
total_files = len(files_to_migrate)
if total_files == 0:
print("No files found in external storage to migrate back to PostgreSQL.")
return
print(f"Found {total_files} files to migrate back to PostgreSQL large objects.")
_set_tenant_contextvar(session)
migrated_count = 0
# only create external store if we have files to migrate. This line
# makes it so we need to have S3/MinIO configured to run this migration.
external_store = get_s3_file_store(db_session=session)
for i, file_id in enumerate(files_to_migrate, 1):
print(f"Migrating file {i}/{total_files}: {file_id}")
# Read file content from external storage (always binary)
try:
file_io = external_store.read_file(
file_id=file_id, mode="b", use_tempfile=True
)
file_io.seek(0)
# Import lazily to avoid circular deps at Alembic runtime
from onyx.db._deprecated.pg_file_store import (
create_populate_lobj,
) # noqa: E402
# Create new Postgres large object and populate it
lobj_oid = create_populate_lobj(content=file_io, db_session=session)
# Update DB row: set lobj_oid, clear bucket/object_key
session.execute(
text(
"UPDATE file_record SET lobj_oid = :lobj_oid, bucket_name = NULL, "
"object_key = NULL WHERE file_id = :file_id"
),
{"lobj_oid": lobj_oid, "file_id": file_id},
)
except ClientError as e:
if "NoSuchKey" in str(e):
print(
f"File {file_id} not found in external storage. Deleting from database."
)
session.execute(
text("DELETE FROM file_record WHERE file_id = :file_id"),
{"file_id": file_id},
)
else:
raise
migrated_count += 1
print(f"✓ Successfully migrated file {i}/{total_files}: {file_id}")
# Flush the SQLAlchemy session so statements are sent to the DB, but **do not**
# commit the transaction. The surrounding Alembic migration will commit once
# the *entire* downgrade succeeds. This keeps the whole downgrade atomic and
# avoids leaving the database in a partially-migrated state if a later schema
# operation fails.
session.flush()
print(
f"Migration back to PostgreSQL completed: {migrated_count} files staged for commit."
)
def _migrate_files_to_external_storage() -> None:
"""Migrate files from PostgreSQL large objects to external storage"""
# Get database session
bind = op.get_bind()
session = Session(bind=bind)
external_store = get_s3_file_store(db_session=session)
# Find all files currently stored in PostgreSQL (lobj_oid is not null)
result = session.execute(
text(
"SELECT file_id FROM file_record WHERE lobj_oid IS NOT NULL "
"AND bucket_name IS NULL AND object_key IS NULL"
)
)
files_to_migrate = [row[0] for row in result.fetchall()]
total_files = len(files_to_migrate)
if total_files == 0:
print("No files found in PostgreSQL storage to migrate.")
return
print(f"Found {total_files} files to migrate from PostgreSQL to external storage.")
_set_tenant_contextvar(session)
migrated_count = 0
for i, file_id in enumerate(files_to_migrate, 1):
print(f"Migrating file {i}/{total_files}: {file_id}")
# Read file record to get metadata
file_record = session.execute(
text("SELECT * FROM file_record WHERE file_id = :file_id"),
{"file_id": file_id},
).fetchone()
if file_record is None:
print(f"File {file_id} not found in PostgreSQL storage.")
continue
lobj_id = cast(int, file_record.lobj_oid) # type: ignore
file_metadata = cast(Any, file_record.file_metadata) # type: ignore
# Read file content from PostgreSQL
try:
file_content = read_lobj(
lobj_id, db_session=session, mode="b", use_tempfile=True
)
except Exception as e:
if "large object" in str(e) and "does not exist" in str(e):
print(f"File {file_id} not found in PostgreSQL storage.")
continue
else:
raise
# Handle file_metadata type conversion
file_metadata = None
if file_metadata is not None:
if isinstance(file_metadata, dict):
file_metadata = file_metadata
else:
# Convert other types to dict if possible, otherwise None
try:
file_metadata = dict(file_record.file_metadata) # type: ignore
except (TypeError, ValueError):
file_metadata = None
# Save to external storage (this will handle the database record update and cleanup)
# NOTE: this WILL .commit() the transaction.
external_store.save_file(
file_id=file_id,
content=file_content,
display_name=file_record.display_name,
file_origin=file_record.file_origin,
file_type=file_record.file_type,
file_metadata=file_metadata,
)
delete_lobj_by_id(lobj_id, db_session=session)
migrated_count += 1
print(f"✓ Successfully migrated file {i}/{total_files}: {file_id}")
# See note above flush but do **not** commit so the outer Alembic transaction
# controls atomicity.
session.flush()
print(
f"Migration completed: {migrated_count} files staged for commit to external storage."
)
def _set_tenant_contextvar(session: Session) -> None:
"""Set the tenant contextvar to the default schema"""
current_tenant = session.execute(text("SELECT current_schema()")).scalar()
print(f"Migrating files for tenant: {current_tenant}")
CURRENT_TENANT_ID_CONTEXTVAR.set(current_tenant)

View File

@@ -11,7 +11,7 @@ import sqlalchemy as sa
import json
from onyx.configs.constants import DocumentSource
from onyx.connectors.onyx_jira.utils import extract_jira_project
from onyx.connectors.jira.utils import extract_jira_project
# revision identifiers, used by Alembic.

View File

@@ -8,7 +8,7 @@ from sqlalchemy.ext.asyncio import create_async_engine
from sqlalchemy.schema import SchemaItem
from alembic import context
from onyx.db.engine import build_connection_string
from onyx.db.engine.sql_engine import build_connection_string
from onyx.db.models import PublicBase
# this is the Alembic Config object, which provides

View File

@@ -16,7 +16,7 @@ from onyx.configs.constants import FileOrigin
from onyx.configs.constants import FileType
from onyx.configs.constants import OnyxCeleryTask
from onyx.configs.constants import QueryHistoryType
from onyx.db.engine import get_session_with_current_tenant
from onyx.db.engine.sql_engine import get_session_with_current_tenant
from onyx.db.tasks import delete_task_with_id
from onyx.db.tasks import mark_task_as_finished_with_id
from onyx.db.tasks import mark_task_as_started_with_id
@@ -35,7 +35,13 @@ logger = setup_logger()
trail=False,
)
def export_query_history_task(
self: Task, *, start: datetime, end: datetime, start_time: datetime
self: Task,
*,
start: datetime,
end: datetime,
start_time: datetime,
# Need to include the tenant_id since the TenantAwareTask needs this
tenant_id: str,
) -> None:
if not self.request.id:
raise RuntimeError("No task id defined for this task; cannot identify it")
@@ -86,7 +92,6 @@ def export_query_history_task(
try:
stream.seek(0)
get_default_file_store(db_session).save_file(
file_name=report_name,
content=stream,
display_name=report_name,
file_origin=FileOrigin.QUERY_HISTORY_CSV,
@@ -96,6 +101,7 @@ def export_query_history_task(
"end": end.isoformat(),
"start_time": start_time.isoformat(),
},
file_id=report_name,
)
delete_task_with_id(

View File

@@ -13,7 +13,7 @@ from onyx.configs.app_configs import JOB_TIMEOUT
from onyx.configs.constants import OnyxCeleryTask
from onyx.db.chat import delete_chat_session
from onyx.db.chat import get_chat_sessions_older_than
from onyx.db.engine import get_session_with_current_tenant
from onyx.db.engine.sql_engine import get_session_with_current_tenant
from onyx.db.enums import TaskStatus
from onyx.db.tasks import mark_task_as_finished_with_id
from onyx.db.tasks import register_task

View File

@@ -20,39 +20,36 @@ from shared_configs.configs import MULTI_TENANT
ee_beat_system_tasks: list[dict] = []
ee_beat_task_templates: list[dict] = []
ee_beat_task_templates.extend(
[
{
"name": "autogenerate-usage-report",
"task": OnyxCeleryTask.AUTOGENERATE_USAGE_REPORT_TASK,
"schedule": timedelta(days=30),
"options": {
"priority": OnyxCeleryPriority.MEDIUM,
"expires": BEAT_EXPIRES_DEFAULT,
},
ee_beat_task_templates: list[dict] = [
{
"name": "autogenerate-usage-report",
"task": OnyxCeleryTask.AUTOGENERATE_USAGE_REPORT_TASK,
"schedule": timedelta(days=30),
"options": {
"priority": OnyxCeleryPriority.MEDIUM,
"expires": BEAT_EXPIRES_DEFAULT,
},
{
"name": "check-ttl-management",
"task": OnyxCeleryTask.CHECK_TTL_MANAGEMENT_TASK,
"schedule": timedelta(hours=CHECK_TTL_MANAGEMENT_TASK_FREQUENCY_IN_HOURS),
"options": {
"priority": OnyxCeleryPriority.MEDIUM,
"expires": BEAT_EXPIRES_DEFAULT,
},
},
{
"name": "check-ttl-management",
"task": OnyxCeleryTask.CHECK_TTL_MANAGEMENT_TASK,
"schedule": timedelta(hours=CHECK_TTL_MANAGEMENT_TASK_FREQUENCY_IN_HOURS),
"options": {
"priority": OnyxCeleryPriority.MEDIUM,
"expires": BEAT_EXPIRES_DEFAULT,
},
{
"name": "export-query-history-cleanup-task",
"task": OnyxCeleryTask.EXPORT_QUERY_HISTORY_CLEANUP_TASK,
"schedule": timedelta(hours=1),
"options": {
"priority": OnyxCeleryPriority.MEDIUM,
"expires": BEAT_EXPIRES_DEFAULT,
"queue": OnyxCeleryQueues.CSV_GENERATION,
},
},
{
"name": "export-query-history-cleanup-task",
"task": OnyxCeleryTask.EXPORT_QUERY_HISTORY_CLEANUP_TASK,
"schedule": timedelta(hours=1),
"options": {
"priority": OnyxCeleryPriority.MEDIUM,
"expires": BEAT_EXPIRES_DEFAULT,
"queue": OnyxCeleryQueues.CSV_GENERATION,
},
]
)
},
]
ee_tasks_to_schedule: list[dict] = []

View File

@@ -6,7 +6,7 @@ from celery import shared_task
from ee.onyx.db.query_history import get_all_query_history_export_tasks
from onyx.configs.app_configs import JOB_TIMEOUT
from onyx.configs.constants import OnyxCeleryTask
from onyx.db.engine import get_session_with_tenant
from onyx.db.engine.sql_engine import get_session_with_tenant
from onyx.db.enums import TaskStatus
from onyx.db.tasks import delete_task_with_id
from onyx.utils.logger import setup_logger

View File

@@ -13,7 +13,7 @@ from onyx.configs.constants import ONYX_CLOUD_TENANT_ID
from onyx.configs.constants import OnyxCeleryPriority
from onyx.configs.constants import OnyxCeleryTask
from onyx.configs.constants import OnyxRedisLocks
from onyx.db.engine import get_all_tenant_ids
from onyx.db.engine.tenant_utils import get_all_tenant_ids
from onyx.redis.redis_pool import get_redis_client
from onyx.redis.redis_pool import redis_lock_dump
from shared_configs.configs import IGNORED_SYNCING_TENANT_LIST

View File

@@ -30,6 +30,7 @@ from onyx.background.celery.celery_redis import celery_find_task
from onyx.background.celery.celery_redis import celery_get_queue_length
from onyx.background.celery.celery_redis import celery_get_queued_task_ids
from onyx.background.celery.celery_redis import celery_get_unacked_task_ids
from onyx.background.celery.tasks.beat_schedule import CLOUD_BEAT_MULTIPLIER_DEFAULT
from onyx.configs.app_configs import JOB_TIMEOUT
from onyx.configs.constants import CELERY_GENERIC_BEAT_LOCK_TIMEOUT
from onyx.configs.constants import CELERY_PERMISSIONS_SYNC_LOCK_TIMEOUT
@@ -47,8 +48,8 @@ from onyx.db.connector import mark_cc_pair_as_permissions_synced
from onyx.db.connector_credential_pair import get_connector_credential_pair_from_id
from onyx.db.document import get_document_ids_for_connector_credential_pair
from onyx.db.document import upsert_document_by_connector_credential_pair
from onyx.db.engine import get_session_with_current_tenant
from onyx.db.engine import get_session_with_tenant
from onyx.db.engine.sql_engine import get_session_with_current_tenant
from onyx.db.engine.sql_engine import get_session_with_tenant
from onyx.db.enums import AccessType
from onyx.db.enums import ConnectorCredentialPairStatus
from onyx.db.enums import SyncStatus
@@ -73,6 +74,7 @@ from onyx.utils.logger import LoggerContextVars
from onyx.utils.logger import setup_logger
from onyx.utils.telemetry import optional_telemetry
from onyx.utils.telemetry import RecordType
from shared_configs.configs import MULTI_TENANT
logger = setup_logger()
@@ -87,6 +89,24 @@ LIGHT_SOFT_TIME_LIMIT = 105
LIGHT_TIME_LIMIT = LIGHT_SOFT_TIME_LIMIT + 15
def _get_fence_validation_block_expiration() -> int:
"""
Compute the expiration time for the fence validation block signal.
Base expiration is 300 seconds, multiplied by the beat multiplier only in MULTI_TENANT mode.
"""
base_expiration = 300 # seconds
if not MULTI_TENANT:
return base_expiration
try:
beat_multiplier = OnyxRuntime.get_beat_multiplier()
except Exception:
beat_multiplier = CLOUD_BEAT_MULTIPLIER_DEFAULT
return int(base_expiration * beat_multiplier)
"""Jobs / utils for kicking off doc permissions sync tasks."""
@@ -194,7 +214,11 @@ def check_for_doc_permissions_sync(self: Task, *, tenant_id: str) -> bool | None
"Exception while validating permission sync fences"
)
r.set(OnyxRedisSignals.BLOCK_VALIDATE_PERMISSION_SYNC_FENCES, 1, ex=300)
r.set(
OnyxRedisSignals.BLOCK_VALIDATE_PERMISSION_SYNC_FENCES,
1,
ex=_get_fence_validation_block_expiration(),
)
# use a lookup table to find active fences. We still have to verify the fence
# exists since it is an optimization and not the source of truth.
@@ -425,6 +449,7 @@ def connector_permission_sync_generator_task(
created = validate_ccpair_for_user(
cc_pair.connector.id,
cc_pair.credential.id,
cc_pair.access_type,
db_session,
enforce_creation=False,
)
@@ -597,91 +622,6 @@ def document_update_permissions(
return True
# NOTE(rkuo): Deprecating this due to degenerate behavior in Redis from sending
# large permissions through celery (over 1MB in size)
# @shared_task(
# name=OnyxCeleryTask.UPDATE_EXTERNAL_DOCUMENT_PERMISSIONS_TASK,
# soft_time_limit=LIGHT_SOFT_TIME_LIMIT,
# time_limit=LIGHT_TIME_LIMIT,
# max_retries=DOCUMENT_PERMISSIONS_UPDATE_MAX_RETRIES,
# bind=True,
# )
# def update_external_document_permissions_task(
# self: Task,
# tenant_id: str,
# serialized_doc_external_access: dict,
# source_string: str,
# connector_id: int,
# credential_id: int,
# ) -> bool:
# start = time.monotonic()
# completion_status = OnyxCeleryTaskCompletionStatus.UNDEFINED
# document_external_access = DocExternalAccess.from_dict(
# serialized_doc_external_access
# )
# doc_id = document_external_access.doc_id
# external_access = document_external_access.external_access
# try:
# with get_session_with_current_tenant() as db_session:
# # Add the users to the DB if they don't exist
# batch_add_ext_perm_user_if_not_exists(
# db_session=db_session,
# emails=list(external_access.external_user_emails),
# continue_on_error=True,
# )
# # Then upsert the document's external permissions
# created_new_doc = upsert_document_external_perms(
# db_session=db_session,
# doc_id=doc_id,
# external_access=external_access,
# source_type=DocumentSource(source_string),
# )
# if created_new_doc:
# # If a new document was created, we associate it with the cc_pair
# upsert_document_by_connector_credential_pair(
# db_session=db_session,
# connector_id=connector_id,
# credential_id=credential_id,
# document_ids=[doc_id],
# )
# elapsed = time.monotonic() - start
# task_logger.info(
# f"connector_id={connector_id} "
# f"doc={doc_id} "
# f"action=update_permissions "
# f"elapsed={elapsed:.2f}"
# )
# completion_status = OnyxCeleryTaskCompletionStatus.SUCCEEDED
# except Exception as e:
# error_msg = format_error_for_logging(e)
# task_logger.warning(
# f"Exception in update_external_document_permissions_task: connector_id={connector_id} doc_id={doc_id} {error_msg}"
# )
# task_logger.exception(
# f"update_external_document_permissions_task exceptioned: "
# f"connector_id={connector_id} doc_id={doc_id}"
# )
# completion_status = OnyxCeleryTaskCompletionStatus.NON_RETRYABLE_EXCEPTION
# finally:
# task_logger.info(
# f"update_external_document_permissions_task completed: status={completion_status.value} doc={doc_id}"
# )
# if completion_status != OnyxCeleryTaskCompletionStatus.SUCCEEDED:
# return False
# task_logger.info(
# f"update_external_document_permissions_task finished: connector_id={connector_id} doc_id={doc_id}"
# )
# return True
def validate_permission_sync_fences(
tenant_id: str,
r: Redis,

View File

@@ -20,7 +20,9 @@ from ee.onyx.background.celery.tasks.external_group_syncing.group_sync_utils imp
from ee.onyx.db.connector_credential_pair import get_all_auto_sync_cc_pairs
from ee.onyx.db.connector_credential_pair import get_cc_pairs_by_source
from ee.onyx.db.external_perm import ExternalUserGroup
from ee.onyx.db.external_perm import replace_user__ext_group_for_cc_pair
from ee.onyx.db.external_perm import mark_old_external_groups_as_stale
from ee.onyx.db.external_perm import remove_stale_external_groups
from ee.onyx.db.external_perm import upsert_external_groups
from ee.onyx.external_permissions.sync_params import (
get_all_cc_pair_agnostic_group_sync_sources,
)
@@ -28,6 +30,7 @@ from ee.onyx.external_permissions.sync_params import get_source_perm_sync_config
from onyx.background.celery.apps.app_base import task_logger
from onyx.background.celery.celery_redis import celery_find_task
from onyx.background.celery.celery_redis import celery_get_unacked_task_ids
from onyx.background.celery.tasks.beat_schedule import CLOUD_BEAT_MULTIPLIER_DEFAULT
from onyx.background.error_logging import emit_background_error
from onyx.configs.app_configs import JOB_TIMEOUT
from onyx.configs.constants import CELERY_EXTERNAL_GROUP_SYNC_LOCK_TIMEOUT
@@ -39,9 +42,8 @@ from onyx.configs.constants import OnyxCeleryTask
from onyx.configs.constants import OnyxRedisConstants
from onyx.configs.constants import OnyxRedisLocks
from onyx.configs.constants import OnyxRedisSignals
from onyx.connectors.exceptions import ConnectorValidationError
from onyx.db.connector_credential_pair import get_connector_credential_pair_from_id
from onyx.db.engine import get_session_with_current_tenant
from onyx.db.engine.sql_engine import get_session_with_current_tenant
from onyx.db.enums import AccessType
from onyx.db.enums import ConnectorCredentialPairStatus
from onyx.db.enums import SyncStatus
@@ -56,19 +58,34 @@ from onyx.redis.redis_connector_ext_group_sync import (
)
from onyx.redis.redis_pool import get_redis_client
from onyx.redis.redis_pool import get_redis_replica_client
from onyx.server.runtime.onyx_runtime import OnyxRuntime
from onyx.server.utils import make_short_id
from onyx.utils.logger import format_error_for_logging
from onyx.utils.logger import setup_logger
from shared_configs.configs import MULTI_TENANT
logger = setup_logger()
EXTERNAL_GROUPS_UPDATE_MAX_RETRIES = 3
_EXTERNAL_GROUP_BATCH_SIZE = 100
# 5 seconds more than RetryDocumentIndex STOP_AFTER+MAX_WAIT
LIGHT_SOFT_TIME_LIMIT = 105
LIGHT_TIME_LIMIT = LIGHT_SOFT_TIME_LIMIT + 15
def _get_fence_validation_block_expiration() -> int:
"""
Compute the expiration time for the fence validation block signal.
Base expiration is 300 seconds, multiplied by the beat multiplier only in MULTI_TENANT mode.
"""
base_expiration = 300 # seconds
if not MULTI_TENANT:
return base_expiration
try:
beat_multiplier = OnyxRuntime.get_beat_multiplier()
except Exception:
beat_multiplier = CLOUD_BEAT_MULTIPLIER_DEFAULT
return int(base_expiration * beat_multiplier)
def _is_external_group_sync_due(cc_pair: ConnectorCredentialPair) -> bool:
@@ -198,7 +215,11 @@ def check_for_external_group_sync(self: Task, *, tenant_id: str) -> bool | None:
"Exception while validating external group sync fences"
)
r.set(OnyxRedisSignals.BLOCK_VALIDATE_EXTERNAL_GROUP_SYNC_FENCES, 1, ex=300)
r.set(
OnyxRedisSignals.BLOCK_VALIDATE_EXTERNAL_GROUP_SYNC_FENCES,
1,
ex=_get_fence_validation_block_expiration(),
)
except SoftTimeLimitExceeded:
task_logger.info(
"Soft time limit exceeded, task is being terminated gracefully."
@@ -377,63 +398,12 @@ def connector_external_group_sync_generator_task(
payload.started = datetime.now(timezone.utc)
redis_connector.external_group_sync.set_fence(payload)
_perform_external_group_sync(
cc_pair_id=cc_pair_id,
tenant_id=tenant_id,
)
with get_session_with_current_tenant() as db_session:
cc_pair = get_connector_credential_pair_from_id(
db_session=db_session,
cc_pair_id=cc_pair_id,
eager_load_credential=True,
)
if cc_pair is None:
raise ValueError(
f"No connector credential pair found for id: {cc_pair_id}"
)
source_type = cc_pair.connector.source
sync_config = get_source_perm_sync_config(source_type)
if sync_config is None:
msg = (
f"No sync config found for {source_type} for cc_pair: {cc_pair_id}"
)
emit_background_error(msg, cc_pair_id=cc_pair_id)
raise ValueError(msg)
if sync_config.group_sync_config is None:
msg = f"No group sync config found for {source_type} for cc_pair: {cc_pair_id}"
emit_background_error(msg, cc_pair_id=cc_pair_id)
raise ValueError(msg)
ext_group_sync_func = sync_config.group_sync_config.group_sync_func
logger.info(
f"Syncing external groups for {source_type} for cc_pair: {cc_pair_id}"
)
external_user_groups: list[ExternalUserGroup] = []
try:
external_user_groups = ext_group_sync_func(tenant_id, cc_pair)
except ConnectorValidationError as e:
# TODO: add some notification to the admins here
logger.exception(
f"Error syncing external groups for {source_type} for cc_pair: {cc_pair_id} {e}"
)
raise e
logger.info(
f"Syncing {len(external_user_groups)} external user groups for {source_type}"
)
logger.debug(f"New external user groups: {external_user_groups}")
replace_user__ext_group_for_cc_pair(
db_session=db_session,
cc_pair_id=cc_pair.id,
group_defs=external_user_groups,
source=cc_pair.connector.source,
)
logger.info(
f"Synced {len(external_user_groups)} external user groups for {source_type}"
)
mark_all_relevant_cc_pairs_as_external_group_synced(db_session, cc_pair)
update_sync_record_status(
db_session=db_session,
entity_id=cc_pair_id,
@@ -475,6 +445,81 @@ def connector_external_group_sync_generator_task(
)
def _perform_external_group_sync(
cc_pair_id: int,
tenant_id: str,
) -> None:
with get_session_with_current_tenant() as db_session:
cc_pair = get_connector_credential_pair_from_id(
db_session=db_session,
cc_pair_id=cc_pair_id,
eager_load_credential=True,
)
if cc_pair is None:
raise ValueError(f"No connector credential pair found for id: {cc_pair_id}")
source_type = cc_pair.connector.source
sync_config = get_source_perm_sync_config(source_type)
if sync_config is None:
msg = f"No sync config found for {source_type} for cc_pair: {cc_pair_id}"
emit_background_error(msg, cc_pair_id=cc_pair_id)
raise ValueError(msg)
if sync_config.group_sync_config is None:
msg = f"No group sync config found for {source_type} for cc_pair: {cc_pair_id}"
emit_background_error(msg, cc_pair_id=cc_pair_id)
raise ValueError(msg)
ext_group_sync_func = sync_config.group_sync_config.group_sync_func
logger.info(
f"Marking old external groups as stale for {source_type} for cc_pair: {cc_pair_id}"
)
mark_old_external_groups_as_stale(db_session, cc_pair_id)
logger.info(
f"Syncing external groups for {source_type} for cc_pair: {cc_pair_id}"
)
external_user_group_batch: list[ExternalUserGroup] = []
try:
external_user_group_generator = ext_group_sync_func(tenant_id, cc_pair)
for external_user_group in external_user_group_generator:
external_user_group_batch.append(external_user_group)
if len(external_user_group_batch) >= _EXTERNAL_GROUP_BATCH_SIZE:
logger.debug(
f"New external user groups: {external_user_group_batch}"
)
upsert_external_groups(
db_session=db_session,
cc_pair_id=cc_pair_id,
external_groups=external_user_group_batch,
source=cc_pair.connector.source,
)
external_user_group_batch = []
if external_user_group_batch:
logger.debug(f"New external user groups: {external_user_group_batch}")
upsert_external_groups(
db_session=db_session,
cc_pair_id=cc_pair_id,
external_groups=external_user_group_batch,
source=cc_pair.connector.source,
)
except Exception as e:
# TODO: add some notification to the admins here
logger.exception(
f"Error syncing external groups for {source_type} for cc_pair: {cc_pair_id} {e}"
)
raise e
logger.info(
f"Removing stale external groups for {source_type} for cc_pair: {cc_pair_id}"
)
remove_stale_external_groups(db_session, cc_pair_id)
mark_all_relevant_cc_pairs_as_external_group_synced(db_session, cc_pair)
def validate_external_group_sync_fences(
tenant_id: str,
celery_app: Celery,

View File

@@ -19,7 +19,7 @@ from onyx.configs.constants import ONYX_CLOUD_TENANT_ID
from onyx.configs.constants import OnyxCeleryQueues
from onyx.configs.constants import OnyxCeleryTask
from onyx.configs.constants import OnyxRedisLocks
from onyx.db.engine import get_session_with_shared_schema
from onyx.db.engine.sql_engine import get_session_with_shared_schema
from onyx.db.models import AvailableTenant
from onyx.redis.redis_pool import get_redis_client
from shared_configs.configs import MULTI_TENANT

View File

@@ -53,6 +53,16 @@ CONFLUENCE_ANONYMOUS_ACCESS_IS_PUBLIC = (
)
#####
# JIRA
#####
# In seconds, default is 30 minutes
JIRA_PERMISSION_DOC_SYNC_FREQUENCY = int(
os.environ.get("JIRA_PERMISSION_DOC_SYNC_FREQUENCY") or 30 * 60
)
#####
# Google Drive
#####
@@ -71,6 +81,15 @@ SLACK_PERMISSION_DOC_SYNC_FREQUENCY = int(
NUM_PERMISSION_WORKERS = int(os.environ.get("NUM_PERMISSION_WORKERS") or 2)
#####
# Teams
#####
# In seconds, default is 5 minutes
TEAMS_PERMISSION_DOC_SYNC_FREQUENCY = int(
os.environ.get("TEAMS_PERMISSION_DOC_SYNC_FREQUENCY") or 5 * 60
)
####
# Celery Job Frequency
####

View File

@@ -0,0 +1,28 @@
from onyx.connectors.confluence.connector import ConfluenceConnector
from onyx.connectors.google_drive.connector import GoogleDriveConnector
from onyx.connectors.interfaces import BaseConnector
def validate_confluence_perm_sync(connector: ConfluenceConnector) -> None:
"""
Validate that the connector is configured correctly for permissions syncing.
"""
def validate_drive_perm_sync(connector: GoogleDriveConnector) -> None:
"""
Validate that the connector is configured correctly for permissions syncing.
"""
def validate_perm_sync(connector: BaseConnector) -> None:
"""
Override this if your connector needs to validate permissions syncing.
Raise an exception if invalid, otherwise do nothing.
Default is a no-op (always successful).
"""
if isinstance(connector, ConfluenceConnector):
validate_confluence_perm_sync(connector)
elif isinstance(connector, GoogleDriveConnector):
validate_drive_perm_sync(connector)

View File

@@ -4,6 +4,7 @@ from uuid import UUID
from pydantic import BaseModel
from sqlalchemy import delete
from sqlalchemy import select
from sqlalchemy import update
from sqlalchemy.orm import Session
from onyx.access.utils import build_ext_group_name_for_onyx
@@ -62,20 +63,41 @@ def delete_public_external_group_for_cc_pair__no_commit(
)
def replace_user__ext_group_for_cc_pair(
def mark_old_external_groups_as_stale(
db_session: Session,
cc_pair_id: int,
group_defs: list[ExternalUserGroup],
) -> None:
db_session.execute(
update(User__ExternalUserGroupId)
.where(User__ExternalUserGroupId.cc_pair_id == cc_pair_id)
.values(stale=True)
)
db_session.execute(
update(PublicExternalUserGroup)
.where(PublicExternalUserGroup.cc_pair_id == cc_pair_id)
.values(stale=True)
)
def upsert_external_groups(
db_session: Session,
cc_pair_id: int,
external_groups: list[ExternalUserGroup],
source: DocumentSource,
) -> None:
"""
This function clears all existing external user group relations for a given cc_pair_id
and replaces them with the new group definitions and commits the changes.
Performs a true upsert operation for external user groups:
- For existing groups (same user_id, external_user_group_id, cc_pair_id), updates the stale flag to False
- For new groups, inserts them with stale=False
- For public groups, uses upsert logic as well
"""
# If there are no groups to add, return early
if not external_groups:
return
# collect all emails from all groups to batch add all users at once for efficiency
all_group_member_emails = set()
for external_group in group_defs:
for external_group in external_groups:
for user_email in external_group.user_emails:
all_group_member_emails.add(user_email)
@@ -86,26 +108,17 @@ def replace_user__ext_group_for_cc_pair(
emails=list(all_group_member_emails),
)
delete_user__ext_group_for_cc_pair__no_commit(
db_session=db_session,
cc_pair_id=cc_pair_id,
)
delete_public_external_group_for_cc_pair__no_commit(
db_session=db_session,
cc_pair_id=cc_pair_id,
)
# map emails to ids
email_id_map = {user.email: user.id for user in all_group_members}
email_id_map = {user.email.lower(): user.id for user in all_group_members}
# use these ids to create new external user group relations relating group_id to user_ids
new_external_permissions: list[User__ExternalUserGroupId] = []
new_public_external_groups: list[PublicExternalUserGroup] = []
for external_group in group_defs:
# Process each external group
for external_group in external_groups:
external_group_id = build_ext_group_name_for_onyx(
ext_group_name=external_group.id,
source=source,
)
# Handle user-group mappings
for user_email in external_group.user_emails:
user_id = email_id_map.get(user_email.lower())
if user_id is None:
@@ -114,24 +127,71 @@ def replace_user__ext_group_for_cc_pair(
f" with email {user_email} not found"
)
continue
new_external_permissions.append(
User__ExternalUserGroupId(
# Check if the user-group mapping already exists
existing_user_group = db_session.scalar(
select(User__ExternalUserGroupId).where(
User__ExternalUserGroupId.user_id == user_id,
User__ExternalUserGroupId.external_user_group_id
== external_group_id,
User__ExternalUserGroupId.cc_pair_id == cc_pair_id,
)
)
if existing_user_group:
# Update existing record
existing_user_group.stale = False
else:
# Insert new record
new_user_group = User__ExternalUserGroupId(
user_id=user_id,
external_user_group_id=external_group_id,
cc_pair_id=cc_pair_id,
stale=False,
)
db_session.add(new_user_group)
# Handle public group if needed
if external_group.gives_anyone_access:
# Check if the public group already exists
existing_public_group = db_session.scalar(
select(PublicExternalUserGroup).where(
PublicExternalUserGroup.external_user_group_id == external_group_id,
PublicExternalUserGroup.cc_pair_id == cc_pair_id,
)
)
if external_group.gives_anyone_access:
new_public_external_groups.append(
PublicExternalUserGroup(
if existing_public_group:
# Update existing record
existing_public_group.stale = False
else:
# Insert new record
new_public_group = PublicExternalUserGroup(
external_user_group_id=external_group_id,
cc_pair_id=cc_pair_id,
stale=False,
)
)
db_session.add(new_public_group)
db_session.add_all(new_external_permissions)
db_session.add_all(new_public_external_groups)
db_session.commit()
def remove_stale_external_groups(
db_session: Session,
cc_pair_id: int,
) -> None:
db_session.execute(
delete(User__ExternalUserGroupId).where(
User__ExternalUserGroupId.cc_pair_id == cc_pair_id,
User__ExternalUserGroupId.stale.is_(True),
)
)
db_session.execute(
delete(PublicExternalUserGroup).where(
PublicExternalUserGroup.cc_pair_id == cc_pair_id,
PublicExternalUserGroup.stale.is_(True),
)
)
db_session.commit()

View File

@@ -115,11 +115,24 @@ def get_all_usage_reports(db_session: Session) -> list[UsageReportMetadata]:
def get_usage_report_data(
db_session: Session,
report_name: str,
report_display_name: str,
) -> IO:
"""
Get the usage report data from the file store.
Args:
db_session: The database session.
report_display_name: The display name of the usage report. Also assumes
that the file is stored with this as the ID in the file store.
Returns:
The usage report data.
"""
file_store = get_default_file_store(db_session)
# usage report may be very large, so don't load it all into memory
return file_store.read_file(file_name=report_name, mode="b", use_tempfile=True)
return file_store.read_file(
file_id=report_display_name, mode="b", use_tempfile=True
)
def write_usage_report(

View File

@@ -6,11 +6,11 @@ https://confluence.atlassian.com/conf85/check-who-can-view-a-page-1283360557.htm
from collections.abc import Generator
from ee.onyx.external_permissions.perm_sync_types import FetchAllDocumentsFunction
from ee.onyx.external_permissions.utils import generic_doc_sync
from onyx.access.models import DocExternalAccess
from onyx.access.models import ExternalAccess
from onyx.configs.constants import DocumentSource
from onyx.connectors.confluence.connector import ConfluenceConnector
from onyx.connectors.credentials_provider import OnyxDBCredentialsProvider
from onyx.connectors.models import SlimDocument
from onyx.db.models import ConnectorCredentialPair
from onyx.indexing.indexing_heartbeat import IndexingHeartbeatInterface
from onyx.utils.logger import setup_logger
@@ -19,6 +19,9 @@ from shared_configs.contextvars import get_current_tenant_id
logger = setup_logger()
CONFLUENCE_DOC_SYNC_LABEL = "confluence_doc_sync"
def confluence_doc_sync(
cc_pair: ConnectorCredentialPair,
fetch_all_existing_docs_fn: FetchAllDocumentsFunction,
@@ -29,7 +32,6 @@ def confluence_doc_sync(
Compares fetched documents against existing documents in the DB for the connector.
If a document exists in the DB but not in the Confluence fetch, it's marked as restricted.
"""
logger.info(f"Starting confluence doc sync for CC Pair ID: {cc_pair.id}")
confluence_connector = ConfluenceConnector(
**cc_pair.connector.connector_specific_config
)
@@ -39,52 +41,11 @@ def confluence_doc_sync(
)
confluence_connector.set_credentials_provider(provider)
slim_docs: list[SlimDocument] = []
logger.info("Fetching all slim documents from confluence")
for doc_batch in confluence_connector.retrieve_all_slim_documents(
callback=callback
):
logger.info(f"Got {len(doc_batch)} slim documents from confluence")
if callback:
if callback.should_stop():
raise RuntimeError("confluence_doc_sync: Stop signal detected")
callback.progress("confluence_doc_sync", 1)
slim_docs.extend(doc_batch)
# Find documents that are no longer accessible in Confluence
logger.info(f"Querying existing document IDs for CC Pair ID: {cc_pair.id}")
existing_doc_ids = fetch_all_existing_docs_fn()
# Find missing doc IDs
fetched_doc_ids = {doc.id for doc in slim_docs}
missing_doc_ids = set(existing_doc_ids) - fetched_doc_ids
# Yield access removal for missing docs. Better to be safe.
if missing_doc_ids:
logger.warning(
f"Found {len(missing_doc_ids)} documents that are in the DB but "
"not present in Confluence fetch. Making them inaccessible."
)
for missing_id in missing_doc_ids:
logger.warning(f"Removing access for document ID: {missing_id}")
yield DocExternalAccess(
doc_id=missing_id,
external_access=ExternalAccess(
external_user_emails=set(),
external_user_group_ids=set(),
is_public=False,
),
)
for doc in slim_docs:
if not doc.external_access:
raise RuntimeError(f"No external access found for document ID: {doc.id}")
yield DocExternalAccess(
doc_id=doc.id,
external_access=doc.external_access,
)
logger.info("Finished confluence doc sync")
yield from generic_doc_sync(
cc_pair=cc_pair,
fetch_all_existing_docs_fn=fetch_all_existing_docs_fn,
callback=callback,
doc_source=DocumentSource.CONFLUENCE,
slim_connector=confluence_connector,
label=CONFLUENCE_DOC_SYNC_LABEL,
)

View File

@@ -1,3 +1,5 @@
from collections.abc import Generator
from ee.onyx.db.external_perm import ExternalUserGroup
from ee.onyx.external_permissions.confluence.constants import ALL_CONF_EMAILS_GROUP_NAME
from onyx.background.error_logging import emit_background_error
@@ -65,7 +67,7 @@ def _build_group_member_email_map(
def confluence_group_sync(
tenant_id: str,
cc_pair: ConnectorCredentialPair,
) -> list[ExternalUserGroup]:
) -> Generator[ExternalUserGroup, None, None]:
provider = OnyxDBCredentialsProvider(tenant_id, "confluence", cc_pair.credential_id)
is_cloud = cc_pair.connector.connector_specific_config.get("is_cloud", False)
wiki_base: str = cc_pair.connector.connector_specific_config["wiki_base"]
@@ -89,10 +91,10 @@ def confluence_group_sync(
confluence_client=confluence_client,
cc_pair_id=cc_pair.id,
)
onyx_groups: list[ExternalUserGroup] = []
all_found_emails = set()
for group_id, group_member_emails in group_member_email_map.items():
onyx_groups.append(
yield (
ExternalUserGroup(
id=group_id,
user_emails=list(group_member_emails),
@@ -107,6 +109,4 @@ def confluence_group_sync(
id=ALL_CONF_EMAILS_GROUP_NAME,
user_emails=list(all_found_emails),
)
onyx_groups.append(all_found_group)
return onyx_groups
yield all_found_group

View File

@@ -1,3 +1,5 @@
from collections.abc import Generator
from googleapiclient.errors import HttpError # type: ignore
from pydantic import BaseModel
@@ -99,6 +101,44 @@ def _get_all_folders(
return all_folders
def _drive_folder_to_onyx_group(
folder: FolderInfo,
group_email_to_member_emails_map: dict[str, list[str]],
) -> ExternalUserGroup:
"""
Converts a folder into an Onyx group.
"""
anyone_can_access = False
folder_member_emails: set[str] = set()
for permission in folder.permissions:
if permission.type == PermissionType.USER:
if permission.email_address is None:
logger.warning(
f"User email is None for folder {folder.id} permission {permission}"
)
continue
folder_member_emails.add(permission.email_address)
elif permission.type == PermissionType.GROUP:
if permission.email_address not in group_email_to_member_emails_map:
logger.warning(
f"Group email {permission.email_address} for folder {folder.id} "
"not found in group_email_to_member_emails_map"
)
continue
folder_member_emails.update(
group_email_to_member_emails_map[permission.email_address]
)
elif permission.type == PermissionType.ANYONE:
anyone_can_access = True
return ExternalUserGroup(
id=folder.id,
user_emails=list(folder_member_emails),
gives_anyone_access=anyone_can_access,
)
"""Individual Shared Drive / My Drive Permission Sync"""
@@ -167,7 +207,29 @@ def _get_drive_members(
return drive_id_to_members_map
def _get_all_groups(
def _drive_member_map_to_onyx_groups(
drive_id_to_members_map: dict[str, tuple[set[str], set[str]]],
group_email_to_member_emails_map: dict[str, list[str]],
) -> Generator[ExternalUserGroup, None, None]:
"""The `user_emails` for the Shared Drive should be all individuals in the
Shared Drive + the union of all flattened group emails."""
for drive_id, (group_emails, user_emails) in drive_id_to_members_map.items():
drive_member_emails: set[str] = user_emails
for group_email in group_emails:
if group_email not in group_email_to_member_emails_map:
logger.warning(
f"Group email {group_email} for drive {drive_id} not found in "
"group_email_to_member_emails_map"
)
continue
drive_member_emails.update(group_email_to_member_emails_map[group_email])
yield ExternalUserGroup(
id=drive_id,
user_emails=list(drive_member_emails),
)
def _get_all_google_groups(
admin_service: AdminService,
google_domain: str,
) -> set[str]:
@@ -185,6 +247,28 @@ def _get_all_groups(
return group_emails
def _google_group_to_onyx_group(
admin_service: AdminService,
group_email: str,
) -> ExternalUserGroup:
"""
This maps google group emails to their member emails.
"""
group_member_emails: set[str] = set()
for member in execute_paginated_retrieval(
admin_service.members().list,
list_key="members",
groupKey=group_email,
fields="members(email),nextPageToken",
):
group_member_emails.add(member["email"])
return ExternalUserGroup(
id=group_email,
user_emails=list(group_member_emails),
)
def _map_group_email_to_member_emails(
admin_service: AdminService,
group_emails: set[str],
@@ -282,7 +366,7 @@ def _build_onyx_groups(
def gdrive_group_sync(
tenant_id: str,
cc_pair: ConnectorCredentialPair,
) -> list[ExternalUserGroup]:
) -> Generator[ExternalUserGroup, None, None]:
# Initialize connector and build credential/service objects
google_drive_connector = GoogleDriveConnector(
**cc_pair.connector.connector_specific_config
@@ -296,26 +380,27 @@ def gdrive_group_sync(
drive_id_to_members_map = _get_drive_members(google_drive_connector, admin_service)
# Get all group emails
all_group_emails = _get_all_groups(
all_group_emails = _get_all_google_groups(
admin_service, google_drive_connector.google_domain
)
# Each google group is an Onyx group, yield those
group_email_to_member_emails_map: dict[str, list[str]] = {}
for group_email in all_group_emails:
onyx_group = _google_group_to_onyx_group(admin_service, group_email)
group_email_to_member_emails_map[group_email] = onyx_group.user_emails
yield onyx_group
# Each drive is a group, yield those
for onyx_group in _drive_member_map_to_onyx_groups(
drive_id_to_members_map, group_email_to_member_emails_map
):
yield onyx_group
# Get all folder permissions
folder_info = _get_all_folders(
google_drive_connector=google_drive_connector,
skip_folders_without_permissions=True,
)
# Map group emails to their members
group_email_to_member_emails_map = _map_group_email_to_member_emails(
admin_service, all_group_emails
)
# Convert the maps to onyx groups
onyx_groups = _build_onyx_groups(
drive_id_to_members_map=drive_id_to_members_map,
group_email_to_member_emails_map=group_email_to_member_emails_map,
folder_info=folder_info,
)
return onyx_groups
for folder in folder_info:
yield _drive_folder_to_onyx_group(folder, group_email_to_member_emails_map)

View File

@@ -0,0 +1,34 @@
from collections.abc import Generator
from ee.onyx.external_permissions.perm_sync_types import FetchAllDocumentsFunction
from ee.onyx.external_permissions.utils import generic_doc_sync
from onyx.access.models import DocExternalAccess
from onyx.configs.constants import DocumentSource
from onyx.connectors.jira.connector import JiraConnector
from onyx.db.models import ConnectorCredentialPair
from onyx.indexing.indexing_heartbeat import IndexingHeartbeatInterface
from onyx.utils.logger import setup_logger
logger = setup_logger()
JIRA_DOC_SYNC_TAG = "jira_doc_sync"
def jira_doc_sync(
cc_pair: ConnectorCredentialPair,
fetch_all_existing_docs_fn: FetchAllDocumentsFunction,
callback: IndexingHeartbeatInterface | None = None,
) -> Generator[DocExternalAccess, None, None]:
jira_connector = JiraConnector(
**cc_pair.connector.connector_specific_config,
)
jira_connector.load_credentials(cc_pair.credential.credential_json)
yield from generic_doc_sync(
cc_pair=cc_pair,
fetch_all_existing_docs_fn=fetch_all_existing_docs_fn,
callback=callback,
doc_source=DocumentSource.JIRA,
slim_connector=jira_connector,
label=JIRA_DOC_SYNC_TAG,
)

View File

@@ -0,0 +1,25 @@
from typing import Any
from pydantic import BaseModel
from pydantic import ConfigDict
from pydantic.alias_generators import to_camel
Holder = dict[str, Any]
class Permission(BaseModel):
id: int
permission: str
holder: Holder | None
class User(BaseModel):
account_id: str
email_address: str
display_name: str
active: bool
model_config = ConfigDict(
alias_generator=to_camel,
)

View File

@@ -0,0 +1,209 @@
from collections import defaultdict
from jira import JIRA
from jira.resources import PermissionScheme
from pydantic import ValidationError
from ee.onyx.external_permissions.jira.models import Holder
from ee.onyx.external_permissions.jira.models import Permission
from ee.onyx.external_permissions.jira.models import User
from onyx.access.models import ExternalAccess
from onyx.utils.logger import setup_logger
HolderMap = dict[str, list[Holder]]
logger = setup_logger()
def _build_holder_map(permissions: list[dict]) -> dict[str, list[Holder]]:
"""
A "Holder" in JIRA is a person / entity who "holds" the corresponding permission.
It can have different types. They can be one of (but not limited to):
- user (an explicitly whitelisted user)
- projectRole (for project level "roles")
- reporter (the reporter of an issue)
A "Holder" usually has following structure:
- `{ "type": "user", "value": "$USER_ID", "user": { .. }, .. }`
- `{ "type": "projectRole", "value": "$PROJECT_ID", .. }`
When we fetch the PermissionSchema from JIRA, we retrieve a list of "Holder"s.
The list of "Holder"s can have multiple "Holder"s of the same type in the list (e.g., you can have two `"type": "user"`s in
there, each corresponding to a different user).
This function constructs a map of "Holder" types to a list of the "Holder"s which contained that type.
Returns:
A dict from the "Holder" type to the actual "Holder" instance.
Example:
```
{
"user": [
{ "type": "user", "value": "10000", "user": { .. }, .. },
{ "type": "user", "value": "10001", "user": { .. }, .. },
],
"projectRole": [
{ "type": "projectRole", "value": "10010", .. },
{ "type": "projectRole", "value": "10011", .. },
],
"applicationRole": [
{ "type": "applicationRole" },
],
..
}
```
"""
holder_map: defaultdict[str, list[Holder]] = defaultdict(list)
for raw_perm in permissions:
if not hasattr(raw_perm, "raw"):
logger.warn(f"Expected a 'raw' field, but none was found: {raw_perm=}")
continue
permission = Permission(**raw_perm.raw)
# We only care about ability to browse through projects + issues (not other permissions such as read/write).
if permission.permission != "BROWSE_PROJECTS":
continue
# In order to associate this permission to some Atlassian entity, we need the "Holder".
# If this doesn't exist, then we cannot associate this permission to anyone; just skip.
if not permission.holder:
logger.warn(
f"Expected to find a permission holder, but none was found: {permission=}"
)
continue
type = permission.holder.get("type")
if not type:
logger.warn(
f"Expected to find the type of permission holder, but none was found: {permission=}"
)
continue
holder_map[type].append(permission.holder)
return holder_map
def _get_user_emails(user_holders: list[Holder]) -> list[str]:
emails = []
for user_holder in user_holders:
if "user" not in user_holder:
continue
raw_user_dict = user_holder["user"]
try:
user_model = User.model_validate(raw_user_dict)
except ValidationError:
logger.error(
"Expected to be able to serialize the raw-user-dict into an instance of `User`, but validation failed;"
f"{raw_user_dict=}"
)
continue
emails.append(user_model.email_address)
return emails
def _get_user_emails_from_project_roles(
jira_client: JIRA,
jira_project: str,
project_role_holders: list[Holder],
) -> list[str]:
# NOTE (@raunakab) a `parallel_yield` may be helpful here...?
roles = [
jira_client.project_role(project=jira_project, id=project_role_holder["value"])
for project_role_holder in project_role_holders
if "value" in project_role_holder
]
emails = []
for role in roles:
if not hasattr(role, "actors"):
continue
for actor in role.actors:
if not hasattr(actor, "actorUser") or not hasattr(
actor.actorUser, "accountId"
):
continue
user = jira_client.user(id=actor.actorUser.accountId)
if not hasattr(user, "accountType") or user.accountType != "atlassian":
continue
if not hasattr(user, "emailAddress"):
msg = f"User's email address was not able to be retrieved; {actor.actorUser.accountId=}"
if hasattr(user, "displayName"):
msg += f" {actor.displayName=}"
logger.warn(msg)
continue
emails.append(user.emailAddress)
return emails
def _build_external_access_from_holder_map(
jira_client: JIRA, jira_project: str, holder_map: HolderMap
) -> ExternalAccess:
"""
# Note:
If the `holder_map` contains an instance of "anyone", then this is a public JIRA project.
Otherwise, we fetch the "projectRole"s (i.e., the user-groups in JIRA speak), and the user emails.
"""
if "anyone" in holder_map:
return ExternalAccess(
external_user_emails=set(), external_user_group_ids=set(), is_public=True
)
user_emails = (
_get_user_emails(user_holders=holder_map["user"])
if "user" in holder_map
else []
)
project_role_user_emails = (
_get_user_emails_from_project_roles(
jira_client=jira_client,
jira_project=jira_project,
project_role_holders=holder_map["projectRole"],
)
if "projectRole" in holder_map
else []
)
external_user_emails = set(user_emails + project_role_user_emails)
return ExternalAccess(
external_user_emails=external_user_emails,
external_user_group_ids=set(),
is_public=False,
)
def get_project_permissions(
jira_client: JIRA,
jira_project: str,
) -> ExternalAccess | None:
project_permissions: PermissionScheme = jira_client.project_permissionscheme(
project=jira_project
)
if not hasattr(project_permissions, "permissions"):
return None
if not isinstance(project_permissions.permissions, list):
return None
holder_map = _build_holder_map(permissions=project_permissions.permissions)
return _build_external_access_from_holder_map(
jira_client=jira_client, jira_project=jira_project, holder_map=holder_map
)

View File

@@ -39,10 +39,10 @@ DocSyncFuncType = Callable[
GroupSyncFuncType = Callable[
[
str,
"ConnectorCredentialPair",
str, # tenant_id
"ConnectorCredentialPair", # cc_pair
],
list["ExternalUserGroup"],
Generator["ExternalUserGroup", None, None],
]
# list of chunks to be censored and the user email. returns censored chunks

View File

@@ -3,7 +3,7 @@ from ee.onyx.external_permissions.sync_params import get_all_censoring_enabled_s
from ee.onyx.external_permissions.sync_params import get_source_perm_sync_config
from onyx.configs.constants import DocumentSource
from onyx.context.search.pipeline import InferenceChunk
from onyx.db.engine import get_session_context_manager
from onyx.db.engine.sql_engine import get_session_with_current_tenant
from onyx.db.models import User
from onyx.utils.logger import setup_logger
@@ -22,7 +22,7 @@ def _get_all_censoring_enabled_sources() -> set[DocumentSource]:
for every single chunk.
"""
all_censoring_enabled_sources = get_all_censoring_enabled_sources()
with get_session_context_manager() as db_session:
with get_session_with_current_tenant() as db_session:
enabled_sync_connectors = get_all_auto_sync_cc_pairs(db_session)
return {
cc_pair.connector.source

View File

@@ -10,7 +10,7 @@ from ee.onyx.external_permissions.salesforce.utils import (
)
from onyx.configs.app_configs import BLURB_SIZE
from onyx.context.search.models import InferenceChunk
from onyx.db.engine import get_session_context_manager
from onyx.db.engine.sql_engine import get_session_with_current_tenant
from onyx.utils.logger import setup_logger
logger = setup_logger()
@@ -44,7 +44,7 @@ def _get_objects_access_for_user_email_from_salesforce(
# This is cached in the function so the first query takes an extra 0.1-0.3 seconds
# but subsequent queries for this source are essentially instant
first_doc_id = chunks[0].document_id
with get_session_context_manager() as db_session:
with get_session_with_current_tenant() as db_session:
salesforce_client = get_any_salesforce_client_for_doc_id(
db_session, first_doc_id
)
@@ -217,7 +217,7 @@ def censor_salesforce_chunks(
def _get_objects_access_for_user_email(
object_ids: set[str], user_email: str
) -> dict[str, bool]:
with get_session_context_manager() as db_session:
with get_session_with_current_tenant() as db_session:
external_groups = fetch_external_groups_for_user_email_and_group_ids(
db_session=db_session,
user_email=user_email,

View File

@@ -8,12 +8,15 @@ from ee.onyx.configs.app_configs import CONFLUENCE_PERMISSION_DOC_SYNC_FREQUENCY
from ee.onyx.configs.app_configs import CONFLUENCE_PERMISSION_GROUP_SYNC_FREQUENCY
from ee.onyx.configs.app_configs import DEFAULT_PERMISSION_DOC_SYNC_FREQUENCY
from ee.onyx.configs.app_configs import GOOGLE_DRIVE_PERMISSION_GROUP_SYNC_FREQUENCY
from ee.onyx.configs.app_configs import JIRA_PERMISSION_DOC_SYNC_FREQUENCY
from ee.onyx.configs.app_configs import SLACK_PERMISSION_DOC_SYNC_FREQUENCY
from ee.onyx.configs.app_configs import TEAMS_PERMISSION_DOC_SYNC_FREQUENCY
from ee.onyx.external_permissions.confluence.doc_sync import confluence_doc_sync
from ee.onyx.external_permissions.confluence.group_sync import confluence_group_sync
from ee.onyx.external_permissions.gmail.doc_sync import gmail_doc_sync
from ee.onyx.external_permissions.google_drive.doc_sync import gdrive_doc_sync
from ee.onyx.external_permissions.google_drive.group_sync import gdrive_group_sync
from ee.onyx.external_permissions.jira.doc_sync import jira_doc_sync
from ee.onyx.external_permissions.perm_sync_types import CensoringFuncType
from ee.onyx.external_permissions.perm_sync_types import DocSyncFuncType
from ee.onyx.external_permissions.perm_sync_types import FetchAllDocumentsFunction
@@ -22,6 +25,7 @@ from ee.onyx.external_permissions.salesforce.postprocessing import (
censor_salesforce_chunks,
)
from ee.onyx.external_permissions.slack.doc_sync import slack_doc_sync
from ee.onyx.external_permissions.teams.doc_sync import teams_doc_sync
from onyx.configs.constants import DocumentSource
if TYPE_CHECKING:
@@ -90,15 +94,21 @@ _SOURCE_TO_SYNC_CONFIG: dict[DocumentSource, SyncConfig] = {
group_sync_is_cc_pair_agnostic=True,
),
),
DocumentSource.JIRA: SyncConfig(
doc_sync_config=DocSyncConfig(
doc_sync_frequency=JIRA_PERMISSION_DOC_SYNC_FREQUENCY,
doc_sync_func=jira_doc_sync,
initial_index_should_sync=True,
),
),
# Groups are not needed for Slack.
# All channel access is done at the individual user level.
DocumentSource.SLACK: SyncConfig(
doc_sync_config=DocSyncConfig(
doc_sync_frequency=SLACK_PERMISSION_DOC_SYNC_FREQUENCY,
doc_sync_func=slack_doc_sync,
initial_index_should_sync=True,
),
# groups are not needed for Slack. All channel access is done at the
# individual user level
group_sync_config=None,
),
DocumentSource.GMAIL: SyncConfig(
doc_sync_config=DocSyncConfig(
@@ -119,6 +129,15 @@ _SOURCE_TO_SYNC_CONFIG: dict[DocumentSource, SyncConfig] = {
initial_index_should_sync=True,
),
),
# Groups are not needed for Teams.
# All channel access is done at the individual user level.
DocumentSource.TEAMS: SyncConfig(
doc_sync_config=DocSyncConfig(
doc_sync_frequency=TEAMS_PERMISSION_DOC_SYNC_FREQUENCY,
doc_sync_func=teams_doc_sync,
initial_index_should_sync=True,
),
),
}

View File

@@ -0,0 +1,35 @@
from collections.abc import Generator
from ee.onyx.external_permissions.perm_sync_types import FetchAllDocumentsFunction
from ee.onyx.external_permissions.utils import generic_doc_sync
from onyx.access.models import DocExternalAccess
from onyx.configs.constants import DocumentSource
from onyx.connectors.teams.connector import TeamsConnector
from onyx.db.models import ConnectorCredentialPair
from onyx.indexing.indexing_heartbeat import IndexingHeartbeatInterface
from onyx.utils.logger import setup_logger
logger = setup_logger()
TEAMS_DOC_SYNC_LABEL = "teams_doc_sync"
def teams_doc_sync(
cc_pair: ConnectorCredentialPair,
fetch_all_existing_docs_fn: FetchAllDocumentsFunction,
callback: IndexingHeartbeatInterface | None,
) -> Generator[DocExternalAccess, None, None]:
teams_connector = TeamsConnector(
**cc_pair.connector.connector_specific_config,
)
teams_connector.load_credentials(cc_pair.credential.credential_json)
yield from generic_doc_sync(
cc_pair=cc_pair,
fetch_all_existing_docs_fn=fetch_all_existing_docs_fn,
callback=callback,
doc_source=DocumentSource.TEAMS,
slim_connector=teams_connector,
label=TEAMS_DOC_SYNC_LABEL,
)

View File

@@ -0,0 +1,83 @@
from collections.abc import Generator
from ee.onyx.external_permissions.perm_sync_types import FetchAllDocumentsFunction
from onyx.access.models import DocExternalAccess
from onyx.access.models import ExternalAccess
from onyx.configs.constants import DocumentSource
from onyx.connectors.interfaces import SlimConnector
from onyx.db.models import ConnectorCredentialPair
from onyx.indexing.indexing_heartbeat import IndexingHeartbeatInterface
from onyx.utils.logger import setup_logger
logger = setup_logger()
def generic_doc_sync(
cc_pair: ConnectorCredentialPair,
fetch_all_existing_docs_fn: FetchAllDocumentsFunction,
callback: IndexingHeartbeatInterface | None,
doc_source: DocumentSource,
slim_connector: SlimConnector,
label: str,
) -> Generator[DocExternalAccess, None, None]:
"""
A convenience function for performing a generic document synchronization.
Notes:
A generic doc sync includes:
- fetching existing docs
- fetching *all* new (slim) docs
- yielding external-access permissions for existing docs which do not exist in the newly fetched slim-docs set (with their
`external_access` set to "private")
- yielding external-access permissions for newly fetched docs
Returns:
A `Generator` which yields existing and newly fetched external-access permissions.
"""
logger.info(f"Starting {doc_source} doc sync for CC Pair ID: {cc_pair.id}")
newly_fetched_doc_ids: set[str] = set()
logger.info(f"Fetching all slim documents from {doc_source}")
for doc_batch in slim_connector.retrieve_all_slim_documents(callback=callback):
logger.info(f"Got {len(doc_batch)} slim documents from {doc_source}")
if callback:
if callback.should_stop():
raise RuntimeError(f"{label}: Stop signal detected")
callback.progress(label, 1)
for doc in doc_batch:
if not doc.external_access:
raise RuntimeError(
f"No external access found for document ID; {cc_pair.id=} {doc_source=} {doc.id=}"
)
newly_fetched_doc_ids.add(doc.id)
yield DocExternalAccess(
doc_id=doc.id,
external_access=doc.external_access,
)
logger.info(f"Querying existing document IDs for CC Pair ID: {cc_pair.id=}")
existing_doc_ids = set(fetch_all_existing_docs_fn())
missing_doc_ids = existing_doc_ids - newly_fetched_doc_ids
if not missing_doc_ids:
return
logger.warning(
f"Found {len(missing_doc_ids)=} documents that are in the DB but not present in fetch. Making them inaccessible."
)
for missing_id in missing_doc_ids:
logger.warning(f"Removing access for {missing_id=}")
yield DocExternalAccess(
doc_id=missing_id,
external_access=ExternalAccess.empty(),
)
logger.info(f"Finished {doc_source} doc sync")

View File

@@ -19,7 +19,7 @@ from ee.onyx.db.analytics import fetch_query_analytics
from ee.onyx.db.analytics import user_can_view_assistant_stats
from onyx.auth.users import current_admin_user
from onyx.auth.users import current_user
from onyx.db.engine import get_session
from onyx.db.engine.sql_engine import get_session
from onyx.db.models import User
router = APIRouter(prefix="/analytics")

View File

@@ -17,7 +17,7 @@ from onyx.background.celery.versioned_apps.client import app as client_app
from onyx.db.connector_credential_pair import (
get_connector_credential_pair_from_id_for_user,
)
from onyx.db.engine import get_session
from onyx.db.engine.sql_engine import get_session
from onyx.db.models import User
from onyx.redis.redis_connector import RedisConnector
from onyx.redis.redis_pool import get_redis_client

View File

@@ -26,9 +26,9 @@ from onyx.auth.users import current_admin_user
from onyx.auth.users import current_user_with_expired_token
from onyx.auth.users import get_user_manager
from onyx.auth.users import UserManager
from onyx.db.engine import get_session
from onyx.db.engine.sql_engine import get_session
from onyx.db.models import User
from onyx.file_store.file_store import PostgresBackedFileStore
from onyx.file_store.file_store import get_default_file_store
from onyx.server.utils import BasicAuthenticationError
from onyx.utils.logger import setup_logger
from shared_configs.configs import MULTI_TENANT
@@ -142,11 +142,12 @@ def put_logo(
def fetch_logo_helper(db_session: Session) -> Response:
try:
file_store = PostgresBackedFileStore(db_session)
file_store = get_default_file_store(db_session)
onyx_file = file_store.get_file_with_mime_type(get_logo_filename())
if not onyx_file:
raise ValueError("get_onyx_file returned None!")
except Exception:
logger.exception("Faield to fetch logo file")
raise HTTPException(
status_code=404,
detail="No logo file found",
@@ -157,7 +158,7 @@ def fetch_logo_helper(db_session: Session) -> Response:
def fetch_logotype_helper(db_session: Session) -> Response:
try:
file_store = PostgresBackedFileStore(db_session)
file_store = get_default_file_store(db_session)
onyx_file = file_store.get_file_with_mime_type(get_logotype_filename())
if not onyx_file:
raise ValueError("get_onyx_file returned None!")

View File

@@ -131,11 +131,11 @@ def upload_logo(
file_store = get_default_file_store(db_session)
file_store.save_file(
file_name=_LOGOTYPE_FILENAME if is_logotype else _LOGO_FILENAME,
content=content,
display_name=display_name,
file_origin=FileOrigin.OTHER,
file_type=file_type,
file_id=_LOGOTYPE_FILENAME if is_logotype else _LOGO_FILENAME,
)
return True

View File

@@ -13,7 +13,7 @@ from ee.onyx.db.standard_answer import remove_standard_answer
from ee.onyx.db.standard_answer import update_standard_answer
from ee.onyx.db.standard_answer import update_standard_answer_category
from onyx.auth.users import current_admin_user
from onyx.db.engine import get_session
from onyx.db.engine.sql_engine import get_session
from onyx.db.models import User
from onyx.server.manage.models import StandardAnswer
from onyx.server.manage.models import StandardAnswerCategory

View File

@@ -11,7 +11,7 @@ from ee.onyx.auth.users import decode_anonymous_user_jwt_token
from onyx.auth.api_key import extract_tenant_from_api_key_header
from onyx.configs.constants import ANONYMOUS_USER_COOKIE_NAME
from onyx.configs.constants import TENANT_ID_COOKIE_NAME
from onyx.db.engine import is_valid_schema_name
from onyx.db.engine.sql_engine import is_valid_schema_name
from onyx.redis.redis_pool import retrieve_auth_token_data_from_redis
from shared_configs.configs import MULTI_TENANT
from shared_configs.configs import POSTGRES_DEFAULT_SCHEMA

View File

@@ -12,10 +12,10 @@ from ee.onyx.server.oauth.slack import SlackOAuth
from onyx.auth.users import current_admin_user
from onyx.configs.app_configs import DEV_MODE
from onyx.configs.constants import DocumentSource
from onyx.db.engine import get_current_tenant_id
from onyx.db.models import User
from onyx.redis.redis_pool import get_redis_client
from onyx.utils.logger import setup_logger
from shared_configs.contextvars import get_current_tenant_id
logger = setup_logger()

View File

@@ -25,12 +25,12 @@ from onyx.connectors.confluence.utils import CONFLUENCE_OAUTH_TOKEN_URL
from onyx.db.credentials import create_credential
from onyx.db.credentials import fetch_credential_by_id_for_user
from onyx.db.credentials import update_credential_json
from onyx.db.engine import get_current_tenant_id
from onyx.db.engine import get_session
from onyx.db.engine.sql_engine import get_session
from onyx.db.models import User
from onyx.redis.redis_pool import get_redis_client
from onyx.server.documents.models import CredentialBase
from onyx.utils.logger import setup_logger
from shared_configs.contextvars import get_current_tenant_id
logger = setup_logger()

View File

@@ -33,11 +33,11 @@ from onyx.connectors.google_utils.shared_constants import (
GoogleOAuthAuthenticationMethod,
)
from onyx.db.credentials import create_credential
from onyx.db.engine import get_current_tenant_id
from onyx.db.engine import get_session
from onyx.db.engine.sql_engine import get_session
from onyx.db.models import User
from onyx.redis.redis_pool import get_redis_client
from onyx.server.documents.models import CredentialBase
from shared_configs.contextvars import get_current_tenant_id
class GoogleDriveOAuth:

View File

@@ -17,11 +17,11 @@ from onyx.configs.app_configs import OAUTH_SLACK_CLIENT_SECRET
from onyx.configs.app_configs import WEB_DOMAIN
from onyx.configs.constants import DocumentSource
from onyx.db.credentials import create_credential
from onyx.db.engine import get_current_tenant_id
from onyx.db.engine import get_session
from onyx.db.engine.sql_engine import get_session
from onyx.db.models import User
from onyx.redis.redis_pool import get_redis_client
from onyx.server.documents.models import CredentialBase
from shared_configs.contextvars import get_current_tenant_id
class SlackOAuth:

View File

@@ -40,7 +40,7 @@ from onyx.context.search.models import SavedSearchDoc
from onyx.db.chat import create_chat_session
from onyx.db.chat import create_new_chat_message
from onyx.db.chat import get_or_create_root_message
from onyx.db.engine import get_session
from onyx.db.engine.sql_engine import get_session
from onyx.db.models import User
from onyx.llm.factory import get_llms_for_persona
from onyx.natural_language_processing.utils import get_tokenizer

View File

@@ -31,7 +31,7 @@ from onyx.context.search.utils import dedupe_documents
from onyx.context.search.utils import drop_llm_indices
from onyx.context.search.utils import relevant_sections_to_indices
from onyx.db.chat import get_prompt_by_id
from onyx.db.engine import get_session
from onyx.db.engine.sql_engine import get_session
from onyx.db.models import Persona
from onyx.db.models import User
from onyx.db.persona import get_persona_by_id

View File

@@ -13,7 +13,7 @@ from sqlalchemy import select
from sqlalchemy.orm import Session
from onyx.db.api_key import is_api_key_email_address
from onyx.db.engine import get_session_with_current_tenant
from onyx.db.engine.sql_engine import get_session_with_current_tenant
from onyx.db.models import ChatMessage
from onyx.db.models import ChatSession
from onyx.db.models import TokenRateLimit

View File

@@ -37,11 +37,11 @@ from onyx.configs.constants import QueryHistoryType
from onyx.configs.constants import SessionType
from onyx.db.chat import get_chat_session_by_id
from onyx.db.chat import get_chat_sessions_by_user
from onyx.db.engine import get_session
from onyx.db.engine.sql_engine import get_session
from onyx.db.enums import TaskStatus
from onyx.db.file_record import get_query_history_export_files
from onyx.db.models import ChatSession
from onyx.db.models import User
from onyx.db.pg_file_store import get_query_history_export_files
from onyx.db.tasks import get_task_with_id
from onyx.db.tasks import register_task
from onyx.file_store.file_store import get_default_file_store
@@ -49,6 +49,7 @@ from onyx.server.documents.models import PaginatedReturn
from onyx.server.query_and_chat.models import ChatSessionDetails
from onyx.server.query_and_chat.models import ChatSessionsResponse
from onyx.utils.threadpool_concurrency import parallel_yield
from shared_configs.contextvars import get_current_tenant_id
router = APIRouter()
@@ -334,6 +335,7 @@ def start_query_history_export(
"start": start,
"end": end,
"start_time": start_time,
"tenant_id": get_current_tenant_id(),
},
)
@@ -360,7 +362,7 @@ def get_query_history_export_status(
report_name = construct_query_history_report_name(request_id)
has_file = file_store.has_file(
file_name=report_name,
file_id=report_name,
file_origin=FileOrigin.QUERY_HISTORY_CSV,
file_type=FileType.CSV,
)
@@ -385,7 +387,7 @@ def download_query_history_csv(
report_name = construct_query_history_report_name(request_id)
file_store = get_default_file_store(db_session)
has_file = file_store.has_file(
file_name=report_name,
file_id=report_name,
file_origin=FileOrigin.QUERY_HISTORY_CSV,
file_type=FileType.CSV,
)

View File

@@ -12,7 +12,7 @@ from onyx.configs.constants import SessionType
from onyx.db.enums import TaskStatus
from onyx.db.models import ChatMessage
from onyx.db.models import ChatSession
from onyx.db.models import PGFileStore
from onyx.db.models import FileRecord
from onyx.db.models import TaskQueueState
@@ -254,7 +254,7 @@ class QueryHistoryExport(BaseModel):
@classmethod
def from_file(
cls,
file: PGFileStore,
file: FileRecord,
) -> "QueryHistoryExport":
if not file.file_metadata or not isinstance(file.file_metadata, dict):
raise RuntimeError(
@@ -262,7 +262,7 @@ class QueryHistoryExport(BaseModel):
)
metadata = QueryHistoryFileMetadata.model_validate(dict(file.file_metadata))
task_id = extract_task_id_from_query_history_report_name(file.file_name)
task_id = extract_task_id_from_query_history_report_name(file.file_id)
return cls(
task_id=task_id,

View File

@@ -14,7 +14,7 @@ from ee.onyx.db.usage_export import get_usage_report_data
from ee.onyx.db.usage_export import UsageReportMetadata
from ee.onyx.server.reporting.usage_export_generation import create_new_usage_report
from onyx.auth.users import current_admin_user
from onyx.db.engine import get_session
from onyx.db.engine.sql_engine import get_session
from onyx.db.models import User
from onyx.file_store.constants import STANDARD_CHUNK_SIZE

View File

@@ -62,17 +62,16 @@ def generate_chat_messages_report(
]
)
# after writing seek to begining of buffer
# after writing seek to beginning of buffer
temp_file.seek(0)
file_store.save_file(
file_name=file_name,
file_id = file_store.save_file(
content=temp_file,
display_name=file_name,
file_origin=FileOrigin.OTHER,
file_type="text/csv",
)
return file_name
return file_id
def generate_user_report(
@@ -97,15 +96,14 @@ def generate_user_report(
csvwriter.writerow([user_skeleton.user_id, user_skeleton.is_active])
temp_file.seek(0)
file_store.save_file(
file_name=file_name,
file_id = file_store.save_file(
content=temp_file,
display_name=file_name,
file_origin=FileOrigin.OTHER,
file_type="text/csv",
)
return file_name
return file_id
def create_new_usage_report(
@@ -116,16 +114,16 @@ def create_new_usage_report(
report_id = str(uuid.uuid4())
file_store = get_default_file_store(db_session)
messages_filename = generate_chat_messages_report(
messages_file_id = generate_chat_messages_report(
db_session, file_store, report_id, period
)
users_filename = generate_user_report(db_session, file_store, report_id)
users_file_id = generate_user_report(db_session, file_store, report_id)
with tempfile.SpooledTemporaryFile(max_size=MAX_IN_MEMORY_SIZE) as zip_buffer:
with zipfile.ZipFile(zip_buffer, "a", zipfile.ZIP_DEFLATED) as zip_file:
# write messages
chat_messages_tmpfile = file_store.read_file(
messages_filename, mode="b", use_tempfile=True
messages_file_id, mode="b", use_tempfile=True
)
zip_file.writestr(
"chat_messages.csv",
@@ -134,7 +132,7 @@ def create_new_usage_report(
# write users
users_tmpfile = file_store.read_file(
users_filename, mode="b", use_tempfile=True
users_file_id, mode="b", use_tempfile=True
)
zip_file.writestr("users.csv", users_tmpfile.read())
@@ -146,11 +144,11 @@ def create_new_usage_report(
f"_{report_id}_usage_report.zip"
)
file_store.save_file(
file_name=report_name,
content=zip_buffer,
display_name=report_name,
file_origin=FileOrigin.GENERATED_REPORT,
file_type="application/zip",
file_id=report_name,
)
# add report after zip file is written

View File

@@ -27,9 +27,9 @@ from onyx.auth.users import get_user_manager
from onyx.configs.app_configs import SESSION_EXPIRE_TIME_SECONDS
from onyx.db.auth import get_user_count
from onyx.db.auth import get_user_db
from onyx.db.engine import get_async_session
from onyx.db.engine import get_async_session_context_manager
from onyx.db.engine import get_session
from onyx.db.engine.async_sql_engine import get_async_session
from onyx.db.engine.async_sql_engine import get_async_session_context_manager
from onyx.db.engine.sql_engine import get_session
from onyx.db.models import User
from onyx.utils.logger import setup_logger

View File

@@ -19,7 +19,7 @@ from ee.onyx.server.enterprise_settings.store import (
)
from ee.onyx.server.enterprise_settings.store import upload_logo
from onyx.context.search.enums import RecencyBiasSetting
from onyx.db.engine import get_session_context_manager
from onyx.db.engine.sql_engine import get_session_with_current_tenant
from onyx.db.llm import update_default_provider
from onyx.db.llm import upsert_llm_provider
from onyx.db.models import Tool
@@ -235,7 +235,7 @@ def seed_db() -> None:
logger.debug("No seeding configuration file passed")
return
with get_session_context_manager() as db_session:
with get_session_with_current_tenant() as db_session:
if seed_config.llms is not None:
_seed_llms(db_session, seed_config.llms)
if seed_config.personas is not None:

View File

@@ -10,7 +10,7 @@ from ee.onyx.server.tenants.user_mapping import get_tenant_id_for_email
from onyx.auth.users import auth_backend
from onyx.auth.users import get_redis_strategy
from onyx.auth.users import User
from onyx.db.engine import get_session_with_tenant
from onyx.db.engine.sql_engine import get_session_with_tenant
from onyx.db.users import get_user_by_email
from onyx.utils.logger import setup_logger

View File

@@ -18,7 +18,7 @@ from onyx.auth.users import optional_user
from onyx.auth.users import User
from onyx.configs.constants import ANONYMOUS_USER_COOKIE_NAME
from onyx.configs.constants import FASTAPI_USERS_AUTH_COOKIE_NAME
from onyx.db.engine import get_session_with_shared_schema
from onyx.db.engine.sql_engine import get_session_with_shared_schema
from onyx.utils.logger import setup_logger
from shared_configs.contextvars import get_current_tenant_id

View File

@@ -28,8 +28,8 @@ from onyx.auth.users import exceptions
from onyx.configs.app_configs import CONTROL_PLANE_API_BASE_URL
from onyx.configs.app_configs import DEV_MODE
from onyx.configs.constants import MilestoneRecordType
from onyx.db.engine import get_session_with_shared_schema
from onyx.db.engine import get_session_with_tenant
from onyx.db.engine.sql_engine import get_session_with_shared_schema
from onyx.db.engine.sql_engine import get_session_with_tenant
from onyx.db.llm import update_default_provider
from onyx.db.llm import upsert_cloud_embedding_provider
from onyx.db.llm import upsert_llm_provider

View File

@@ -8,8 +8,8 @@ from sqlalchemy.schema import CreateSchema
from alembic import command
from alembic.config import Config
from onyx.db.engine import build_connection_string
from onyx.db.engine import get_sqlalchemy_engine
from onyx.db.engine.sql_engine import build_connection_string
from onyx.db.engine.sql_engine import get_sqlalchemy_engine
logger = logging.getLogger(__name__)
@@ -34,7 +34,7 @@ def run_alembic_migrations(schema_name: str) -> None:
# Mimic command-line options by adding 'cmd_opts' to the config
alembic_cfg.cmd_opts = SimpleNamespace() # type: ignore
alembic_cfg.cmd_opts.x = [f"schema={schema_name}"] # type: ignore
alembic_cfg.cmd_opts.x = [f"schemas={schema_name}"] # type: ignore
# Run migrations programmatically
command.upgrade(alembic_cfg, "head")

View File

@@ -9,7 +9,7 @@ from ee.onyx.server.tenants.user_mapping import remove_users_from_tenant
from onyx.auth.users import current_admin_user
from onyx.auth.users import User
from onyx.db.auth import get_user_count
from onyx.db.engine import get_session
from onyx.db.engine.sql_engine import get_session
from onyx.db.users import delete_user_from_db
from onyx.db.users import get_user_by_email
from onyx.server.manage.models import UserByEmail

View File

@@ -5,8 +5,8 @@ from onyx.auth.invited_users import get_invited_users
from onyx.auth.invited_users import get_pending_users
from onyx.auth.invited_users import write_invited_users
from onyx.auth.invited_users import write_pending_users
from onyx.db.engine import get_session_with_shared_schema
from onyx.db.engine import get_session_with_tenant
from onyx.db.engine.sql_engine import get_session_with_shared_schema
from onyx.db.engine.sql_engine import get_session_with_tenant
from onyx.db.models import UserTenantMapping
from onyx.server.manage.models import TenantSnapshot
from onyx.utils.logger import setup_logger

View File

@@ -9,7 +9,7 @@ from ee.onyx.db.token_limit import fetch_user_group_token_rate_limits_for_user
from ee.onyx.db.token_limit import insert_user_group_token_rate_limit
from onyx.auth.users import current_admin_user
from onyx.auth.users import current_curator_or_admin_user
from onyx.db.engine import get_session
from onyx.db.engine.sql_engine import get_session
from onyx.db.models import User
from onyx.db.token_limit import fetch_all_user_token_rate_limits
from onyx.db.token_limit import insert_user_token_rate_limit

View File

@@ -16,7 +16,7 @@ from ee.onyx.server.user_group.models import UserGroupCreate
from ee.onyx.server.user_group.models import UserGroupUpdate
from onyx.auth.users import current_admin_user
from onyx.auth.users import current_curator_or_admin_user
from onyx.db.engine import get_session
from onyx.db.engine.sql_engine import get_session
from onyx.db.models import User
from onyx.db.models import UserRole
from onyx.utils.logger import setup_logger

View File

@@ -40,6 +40,30 @@ class ExternalAccess:
def num_entries(self) -> int:
return len(self.external_user_emails) + len(self.external_user_group_ids)
@classmethod
def public(cls) -> "ExternalAccess":
return cls(
external_user_emails=set(),
external_user_group_ids=set(),
is_public=True,
)
@classmethod
def empty(cls) -> "ExternalAccess":
"""
A helper function that returns an *empty* set of external user-emails and group-ids, and sets `is_public` to `False`.
This effectively makes the document in question "private" or inaccessible to anyone else.
This is especially helpful to use when you are performing permission-syncing, and some document's permissions aren't able
to be determined (for whatever reason). Setting its `ExternalAccess` to "private" is a feasible fallback.
"""
return cls(
external_user_emails=set(),
external_user_group_ids=set(),
is_public=False,
)
@dataclass(frozen=True)
class DocExternalAccess:

View File

@@ -78,7 +78,7 @@ def should_continue(state: BasicState) -> str:
if __name__ == "__main__":
from onyx.db.engine import get_session_context_manager
from onyx.db.engine.sql_engine import get_session_with_current_tenant
from onyx.context.search.models import SearchRequest
from onyx.llm.factory import get_default_llms
from onyx.agents.agent_search.shared_graph_utils.utils import get_test_config
@@ -87,7 +87,7 @@ if __name__ == "__main__":
compiled_graph = graph.compile()
input = BasicInput(unused=True)
primary_llm, fast_llm = get_default_llms()
with get_session_context_manager() as db_session:
with get_session_with_current_tenant() as db_session:
config, _ = get_test_config(
db_session=db_session,
primary_llm=primary_llm,

View File

@@ -4,7 +4,7 @@ from typing import cast
from onyx.chat.models import LlmDoc
from onyx.configs.constants import DocumentSource
from onyx.context.search.models import InferenceSection
from onyx.db.engine import get_session_with_current_tenant
from onyx.db.engine.sql_engine import get_session_with_current_tenant
from onyx.tools.models import SearchToolOverrideKwargs
from onyx.tools.tool_implementations.search.search_tool import (
FINAL_CONTEXT_DOCUMENTS_ID,

View File

@@ -111,7 +111,7 @@ def answer_query_graph_builder() -> StateGraph:
if __name__ == "__main__":
from onyx.db.engine import get_session_context_manager
from onyx.db.engine.sql_engine import get_session_with_current_tenant
from onyx.llm.factory import get_default_llms
from onyx.context.search.models import SearchRequest
@@ -121,7 +121,7 @@ if __name__ == "__main__":
search_request = SearchRequest(
query="what can you do with onyx or danswer?",
)
with get_session_context_manager() as db_session:
with get_session_with_current_tenant() as db_session:
graph_config, search_tool = get_test_config(
db_session, primary_llm, fast_llm, search_request
)

View File

@@ -238,7 +238,7 @@ def agent_search_graph_builder() -> StateGraph:
if __name__ == "__main__":
pass
from onyx.db.engine import get_session_context_manager
from onyx.db.engine.sql_engine import get_session_with_current_tenant
from onyx.llm.factory import get_default_llms
from onyx.context.search.models import SearchRequest
@@ -246,7 +246,7 @@ if __name__ == "__main__":
compiled_graph = graph.compile()
primary_llm, fast_llm = get_default_llms()
with get_session_context_manager() as db_session:
with get_session_with_current_tenant() as db_session:
search_request = SearchRequest(query="Who created Excel?")
graph_config = get_test_config(
db_session, primary_llm, fast_llm, search_request

View File

@@ -109,7 +109,7 @@ def answer_refined_query_graph_builder() -> StateGraph:
if __name__ == "__main__":
from onyx.db.engine import get_session_context_manager
from onyx.db.engine.sql_engine import get_session_with_current_tenant
from onyx.llm.factory import get_default_llms
from onyx.context.search.models import SearchRequest
@@ -119,7 +119,7 @@ if __name__ == "__main__":
search_request = SearchRequest(
query="what can you do with onyx or danswer?",
)
with get_session_context_manager() as db_session:
with get_session_with_current_tenant() as db_session:
inputs = SubQuestionAnsweringInput(
question="what can you do with onyx?",
question_id="0_0",

View File

@@ -131,7 +131,7 @@ def expanded_retrieval_graph_builder() -> StateGraph:
if __name__ == "__main__":
from onyx.db.engine import get_session_context_manager
from onyx.db.engine.sql_engine import get_session_with_current_tenant
from onyx.llm.factory import get_default_llms
from onyx.context.search.models import SearchRequest
@@ -142,7 +142,7 @@ if __name__ == "__main__":
query="what can you do with onyx or danswer?",
)
with get_session_context_manager() as db_session:
with get_session_with_current_tenant() as db_session:
graph_config, search_tool = get_test_config(
db_session, primary_llm, fast_llm, search_request
)

View File

@@ -24,7 +24,7 @@ from onyx.context.search.models import InferenceSection
from onyx.context.search.models import RerankingDetails
from onyx.context.search.postprocessing.postprocessing import rerank_sections
from onyx.context.search.postprocessing.postprocessing import should_rerank
from onyx.db.engine import get_session_context_manager
from onyx.db.engine.sql_engine import get_session_with_current_tenant
from onyx.db.search_settings import get_current_search_settings
from onyx.utils.timing import log_function_time
@@ -60,7 +60,7 @@ def rerank_documents(
allow_agent_reranking = graph_config.behavior.allow_agent_reranking
if rerank_settings is None:
with get_session_context_manager() as db_session:
with get_session_with_current_tenant() as db_session:
search_settings = get_current_search_settings(db_session)
if not search_settings.disable_rerank_for_streaming:
rerank_settings = RerankingDetails.from_db_model(search_settings)

View File

@@ -21,7 +21,7 @@ from onyx.agents.agent_search.shared_graph_utils.utils import (
from onyx.configs.agent_configs import AGENT_MAX_QUERY_RETRIEVAL_RESULTS
from onyx.configs.agent_configs import AGENT_RETRIEVAL_STATS
from onyx.context.search.models import InferenceSection
from onyx.db.engine import get_session_context_manager
from onyx.db.engine.sql_engine import get_session_with_current_tenant
from onyx.tools.models import SearchQueryInfo
from onyx.tools.models import SearchToolOverrideKwargs
from onyx.tools.tool_implementations.search.search_tool import (
@@ -67,7 +67,7 @@ def retrieve_documents(
callback_container: list[list[InferenceSection]] = []
# new db session to avoid concurrency issues
with get_session_context_manager() as db_session:
with get_session_with_current_tenant() as db_session:
for tool_response in search_tool.run(
query=query_to_retrieve,
override_kwargs=SearchToolOverrideKwargs(

View File

@@ -19,7 +19,7 @@ from onyx.chat.models import SubQuestionPiece
from onyx.context.search.models import InferenceChunk
from onyx.context.search.models import InferenceSection
from onyx.db.document import get_kg_doc_info_for_entity_name
from onyx.db.engine import get_session_with_current_tenant
from onyx.db.engine.sql_engine import get_session_with_current_tenant
from onyx.db.entities import get_document_id_for_entity
from onyx.db.entities import get_entity_name
from onyx.db.entity_type import get_entity_types

View File

@@ -25,16 +25,17 @@ from onyx.agents.agent_search.shared_graph_utils.utils import (
)
from onyx.configs.kg_configs import KG_ENTITY_EXTRACTION_TIMEOUT
from onyx.configs.kg_configs import KG_RELATIONSHIP_EXTRACTION_TIMEOUT
from onyx.db.engine import get_session_with_current_tenant
from onyx.db.engine.sql_engine import get_session_with_current_tenant
from onyx.db.kg_temp_view import create_views
from onyx.db.kg_temp_view import get_user_view_names
from onyx.db.relationships import get_allowed_relationship_type_pairs
from onyx.kg.extractions.extraction_processing import get_entity_types_str
from onyx.kg.extractions.extraction_processing import get_relationship_types_str
from onyx.kg.utils.extraction_utils import get_entity_types_str
from onyx.kg.utils.extraction_utils import get_relationship_types_str
from onyx.prompts.kg_prompts import QUERY_ENTITY_EXTRACTION_PROMPT
from onyx.prompts.kg_prompts import QUERY_RELATIONSHIP_EXTRACTION_PROMPT
from onyx.utils.logger import setup_logger
from onyx.utils.threadpool_concurrency import run_with_timeout
from shared_configs.contextvars import get_current_tenant_id
logger = setup_logger()
@@ -80,10 +81,12 @@ def extract_ert(
stream_write_step_activities(writer, _KG_STEP_NR)
# Create temporary views. TODO: move into parallel step, if ultimately materialized
kg_views = get_user_view_names(user_email)
tenant_id = get_current_tenant_id()
kg_views = get_user_view_names(user_email, tenant_id)
with get_session_with_current_tenant() as db_session:
create_views(
db_session,
tenant_id=tenant_id,
user_email=user_email,
allowed_docs_view_name=kg_views.allowed_docs_view_name,
kg_relationships_view_name=kg_views.kg_relationships_view_name,
@@ -133,15 +136,14 @@ def extract_ert(
last_bracket = cleaned_response.rfind("}")
cleaned_response = cleaned_response[first_bracket : last_bracket + 1]
try:
entity_extraction_result = (
KGQuestionEntityExtractionResult.model_validate_json(cleaned_response)
)
except ValidationError:
logger.error("Failed to parse LLM response as JSON in Entity Extraction")
entity_extraction_result = KGQuestionEntityExtractionResult(
entities=[], time_filter=""
)
entity_extraction_result = KGQuestionEntityExtractionResult.model_validate_json(
cleaned_response
)
except ValidationError:
logger.error("Failed to parse LLM response as JSON in Entity Extraction")
entity_extraction_result = KGQuestionEntityExtractionResult(
entities=[], time_filter=""
)
except Exception as e:
logger.error(f"Error in extract_ert: {e}")
entity_extraction_result = KGQuestionEntityExtractionResult(

View File

@@ -27,7 +27,7 @@ from onyx.agents.agent_search.shared_graph_utils.utils import (
get_langgraph_node_log_string,
)
from onyx.configs.kg_configs import KG_STRATEGY_GENERATION_TIMEOUT
from onyx.db.engine import get_session_with_current_tenant
from onyx.db.engine.sql_engine import get_session_with_current_tenant
from onyx.db.entities import get_document_id_for_entity
from onyx.kg.clustering.normalizations import normalize_entities
from onyx.kg.clustering.normalizations import normalize_relationships
@@ -265,10 +265,7 @@ def analyze(
Format: {output_format.value}, Broken down question: {broken_down_question}"
extraction_detected_relationships = len(query_graph_relationships) > 0
if (
extraction_detected_relationships
or relationship_detection == KGRelationshipDetection.RELATIONSHIPS.value
):
if extraction_detected_relationships:
query_type = KGRelationshipDetection.RELATIONSHIPS.value
if extraction_detected_relationships:

View File

@@ -29,7 +29,7 @@ from onyx.configs.kg_configs import KG_SQL_GENERATION_TIMEOUT_OVERRIDE
from onyx.configs.kg_configs import KG_TEMP_ALLOWED_DOCS_VIEW_NAME_PREFIX
from onyx.configs.kg_configs import KG_TEMP_KG_ENTITIES_VIEW_NAME_PREFIX
from onyx.configs.kg_configs import KG_TEMP_KG_RELATIONSHIPS_VIEW_NAME_PREFIX
from onyx.db.engine import get_db_readonly_user_session_with_current_tenant
from onyx.db.engine.sql_engine import get_db_readonly_user_session_with_current_tenant
from onyx.db.kg_temp_view import drop_views
from onyx.llm.interfaces import LLM
from onyx.prompts.kg_prompts import ENTITY_SOURCE_DETECTION_PROMPT
@@ -200,6 +200,9 @@ def generate_simple_sql(
if state.kg_rel_temp_view_name is None:
raise ValueError("kg_rel_temp_view_name is not set")
if state.kg_entity_temp_view_name is None:
raise ValueError("kg_entity_temp_view_name is not set")
## STEP 3 - articulate goals
stream_write_step_activities(writer, _KG_STEP_NR)
@@ -311,9 +314,8 @@ def generate_simple_sql(
)
sql_statement = sql_statement.split(";")[0].strip() + ";"
sql_statement = sql_statement.replace("sql", "").strip()
sql_statement = sql_statement.replace("kg_relationship", rel_temp_view)
if ent_temp_view:
sql_statement = sql_statement.replace("kg_entity", ent_temp_view)
sql_statement = sql_statement.replace("relationship_table", rel_temp_view)
sql_statement = sql_statement.replace("entity_table", ent_temp_view)
reasoning = (
cleaned_response.split("<reasoning>")[1]
@@ -399,7 +401,12 @@ def generate_simple_sql(
if source_documents_sql and ent_temp_view:
source_documents_sql = source_documents_sql.replace(
"kg_entity", ent_temp_view
"entity_table", ent_temp_view
)
if source_documents_sql and rel_temp_view:
source_documents_sql = source_documents_sql.replace(
"relationship_table", rel_temp_view
)
logger.debug(f"A3 source_documents_sql: {source_documents_sql}")

View File

@@ -13,7 +13,7 @@ from onyx.agents.agent_search.shared_graph_utils.utils import (
get_langgraph_node_log_string,
)
from onyx.configs.kg_configs import KG_FILTER_CONSTRUCTION_TIMEOUT
from onyx.db.engine import get_session_with_current_tenant
from onyx.db.engine.sql_engine import get_session_with_current_tenant
from onyx.db.entity_type import get_entity_types_with_grounded_source_name
from onyx.kg.utils.formatting_utils import make_entity_id
from onyx.prompts.kg_prompts import SEARCH_FILTER_CONSTRUCTION_PROMPT

View File

@@ -16,7 +16,7 @@ from onyx.agents.agent_search.shared_graph_utils.utils import (
from onyx.agents.agent_search.shared_graph_utils.utils import write_custom_event
from onyx.chat.models import SubQueryPiece
from onyx.db.document import get_base_llm_doc_information
from onyx.db.engine import get_session_with_current_tenant
from onyx.db.engine.sql_engine import get_session_with_current_tenant
from onyx.utils.logger import setup_logger

View File

@@ -28,7 +28,7 @@ from onyx.configs.kg_configs import KG_TIMEOUT_CONNECT_LLM_INITIAL_ANSWER_GENERA
from onyx.configs.kg_configs import KG_TIMEOUT_LLM_INITIAL_ANSWER_GENERATION
from onyx.context.search.enums import SearchType
from onyx.context.search.models import InferenceSection
from onyx.db.engine import get_session_with_current_tenant
from onyx.db.engine.sql_engine import get_session_with_current_tenant
from onyx.prompts.kg_prompts import OUTPUT_FORMAT_NO_EXAMPLES_PROMPT
from onyx.prompts.kg_prompts import OUTPUT_FORMAT_NO_OVERALL_ANSWER_PROMPT
from onyx.tools.tool_implementations.search.search_tool import IndexFilters

View File

@@ -5,7 +5,7 @@ from onyx.chat.models import LlmDoc
from onyx.configs.constants import DocumentSource
from onyx.configs.kg_configs import KG_RESEARCH_NUM_RETRIEVED_DOCS
from onyx.context.search.models import InferenceSection
from onyx.db.engine import get_session_with_current_tenant
from onyx.db.engine.sql_engine import get_session_with_current_tenant
from onyx.tools.models import SearchToolOverrideKwargs
from onyx.tools.tool_implementations.search.search_tool import (
FINAL_CONTEXT_DOCUMENTS_ID,

View File

@@ -51,6 +51,7 @@ def _create_history_str(prompt_builder: AnswerPromptBuilder) -> str:
else:
continue
history_segments.append(f"{role}:\n {msg.content}\n\n")
return "\n".join(history_segments)

View File

@@ -33,7 +33,7 @@ from onyx.chat.models import SubQueryPiece
from onyx.chat.models import SubQuestionPiece
from onyx.chat.models import ToolResponse
from onyx.context.search.models import SearchRequest
from onyx.db.engine import get_session_context_manager
from onyx.db.engine.sql_engine import get_session_with_current_tenant
from onyx.llm.factory import get_default_llms
from onyx.tools.tool_runner import ToolCallKickoff
from onyx.utils.logger import setup_logger
@@ -195,7 +195,7 @@ if __name__ == "__main__":
query="Do a search to tell me what is the difference between astronomy and astrology?",
)
with get_session_context_manager() as db_session:
with get_session_with_current_tenant() as db_session:
config = get_test_config(db_session, primary_llm, fast_llm, search_request)
assert (
config.persistence is not None

View File

@@ -56,7 +56,7 @@ from onyx.context.search.enums import LLMEvaluationType
from onyx.context.search.models import InferenceSection
from onyx.context.search.models import RetrievalDetails
from onyx.context.search.models import SearchRequest
from onyx.db.engine import get_session_context_manager
from onyx.db.engine.sql_engine import get_session_with_current_tenant
from onyx.db.persona import get_persona_by_id
from onyx.db.persona import Persona
from onyx.llm.chat_llm import LLMRateLimitError
@@ -363,7 +363,7 @@ def retrieve_search_docs(
retrieved_docs: list[InferenceSection] = []
# new db session to avoid concurrency issues
with get_session_context_manager() as db_session:
with get_session_with_current_tenant() as db_session:
for tool_response in search_tool.run(
query=question,
override_kwargs=SearchToolOverrideKwargs(

View File

@@ -97,9 +97,9 @@ from onyx.db.auth import get_default_admin_user_emails
from onyx.db.auth import get_user_count
from onyx.db.auth import get_user_db
from onyx.db.auth import SQLAlchemyUserAdminDB
from onyx.db.engine import get_async_session
from onyx.db.engine import get_async_session_context_manager
from onyx.db.engine import get_session_with_tenant
from onyx.db.engine.async_sql_engine import get_async_session
from onyx.db.engine.async_sql_engine import get_async_session_context_manager
from onyx.db.engine.sql_engine import get_session_with_tenant
from onyx.db.models import AccessToken
from onyx.db.models import OAuthAccount
from onyx.db.models import User

View File

@@ -26,7 +26,7 @@ from onyx.background.celery.celery_utils import celery_is_worker_primary
from onyx.background.celery.celery_utils import make_probe_path
from onyx.configs.constants import ONYX_CLOUD_CELERY_TASK_PREFIX
from onyx.configs.constants import OnyxRedisLocks
from onyx.db.engine import get_sqlalchemy_engine
from onyx.db.engine.sql_engine import get_sqlalchemy_engine
from onyx.document_index.vespa.shared_utils.utils import wait_for_vespa_with_timeout
from onyx.httpx.httpx_pool import HttpxPool
from onyx.redis.redis_connector import RedisConnector

View File

@@ -11,8 +11,8 @@ import onyx.background.celery.apps.app_base as app_base
from onyx.background.celery.celery_utils import make_probe_path
from onyx.background.celery.tasks.beat_schedule import CLOUD_BEAT_MULTIPLIER_DEFAULT
from onyx.configs.constants import POSTGRES_CELERY_BEAT_APP_NAME
from onyx.db.engine import get_all_tenant_ids
from onyx.db.engine import SqlEngine
from onyx.db.engine.sql_engine import SqlEngine
from onyx.db.engine.tenant_utils import get_all_tenant_ids
from onyx.server.runtime.onyx_runtime import OnyxRuntime
from onyx.utils.variable_functionality import fetch_versioned_implementation
from shared_configs.configs import IGNORED_SYNCING_TENANT_LIST

View File

@@ -12,7 +12,7 @@ from celery.signals import worker_shutdown
import onyx.background.celery.apps.app_base as app_base
from onyx.configs.constants import POSTGRES_CELERY_WORKER_HEAVY_APP_NAME
from onyx.db.engine import SqlEngine
from onyx.db.engine.sql_engine import SqlEngine
from onyx.utils.logger import setup_logger
from shared_configs.configs import MULTI_TENANT

View File

@@ -13,7 +13,7 @@ from celery.signals import worker_shutdown
import onyx.background.celery.apps.app_base as app_base
from onyx.configs.constants import POSTGRES_CELERY_WORKER_INDEXING_APP_NAME
from onyx.db.engine import SqlEngine
from onyx.db.engine.sql_engine import SqlEngine
from onyx.utils.logger import setup_logger
from shared_configs.configs import MULTI_TENANT

View File

@@ -13,7 +13,7 @@ from celery.signals import worker_shutdown
import onyx.background.celery.apps.app_base as app_base
from onyx.configs.constants import POSTGRES_CELERY_WORKER_KG_PROCESSING_APP_NAME
from onyx.db.engine import SqlEngine
from onyx.db.engine.sql_engine import SqlEngine
from onyx.utils.logger import setup_logger
from shared_configs.configs import MULTI_TENANT
@@ -98,6 +98,10 @@ def on_setup_logging(
app_base.on_setup_logging(loglevel, logfile, format, colorize, **kwargs)
base_bootsteps = app_base.get_bootsteps()
for bootstep in base_bootsteps:
celery_app.steps["worker"].add(bootstep)
celery_app.autodiscover_tasks(
[
"onyx.background.celery.tasks.kg_processing",

View File

@@ -15,7 +15,7 @@ from onyx.configs.app_configs import MANAGED_VESPA
from onyx.configs.app_configs import VESPA_CLOUD_CERT_PATH
from onyx.configs.app_configs import VESPA_CLOUD_KEY_PATH
from onyx.configs.constants import POSTGRES_CELERY_WORKER_LIGHT_APP_NAME
from onyx.db.engine import SqlEngine
from onyx.db.engine.sql_engine import SqlEngine
from onyx.utils.logger import setup_logger
from shared_configs.configs import MULTI_TENANT

View File

@@ -11,7 +11,7 @@ from celery.signals import worker_shutdown
import onyx.background.celery.apps.app_base as app_base
from onyx.configs.constants import POSTGRES_CELERY_WORKER_MONITORING_APP_NAME
from onyx.db.engine import SqlEngine
from onyx.db.engine.sql_engine import SqlEngine
from onyx.utils.logger import setup_logger
from shared_configs.configs import MULTI_TENANT

View File

@@ -25,8 +25,8 @@ from onyx.configs.constants import CELERY_PRIMARY_WORKER_LOCK_TIMEOUT
from onyx.configs.constants import OnyxRedisConstants
from onyx.configs.constants import OnyxRedisLocks
from onyx.configs.constants import POSTGRES_CELERY_WORKER_PRIMARY_APP_NAME
from onyx.db.engine import get_session_with_current_tenant
from onyx.db.engine import SqlEngine
from onyx.db.engine.sql_engine import get_session_with_current_tenant
from onyx.db.engine.sql_engine import SqlEngine
from onyx.db.index_attempt import get_index_attempt
from onyx.db.index_attempt import mark_attempt_canceled
from onyx.redis.redis_connector_credential_pair import (

View File

@@ -2,6 +2,7 @@ import copy
from datetime import timedelta
from typing import Any
from onyx.configs.app_configs import ENTERPRISE_EDITION_ENABLED
from onyx.configs.app_configs import LLM_MODEL_UPDATE_API_URL
from onyx.configs.constants import ONYX_CLOUD_CELERY_TASK_PREFIX
from onyx.configs.constants import OnyxCeleryPriority
@@ -24,114 +25,134 @@ CLOUD_BEAT_MULTIPLIER_DEFAULT = 8.0
CLOUD_DOC_PERMISSION_SYNC_MULTIPLIER_DEFAULT = 1.0
# tasks that run in either self-hosted on cloud
beat_task_templates: list[dict] = []
beat_task_templates: list[dict] = [
{
"name": "check-for-kg-processing",
"task": OnyxCeleryTask.CHECK_KG_PROCESSING,
"schedule": timedelta(seconds=60),
"options": {
"priority": OnyxCeleryPriority.MEDIUM,
"expires": BEAT_EXPIRES_DEFAULT,
},
},
{
"name": "check-for-kg-processing-clustering-only",
"task": OnyxCeleryTask.CHECK_KG_PROCESSING_CLUSTERING_ONLY,
"schedule": timedelta(seconds=600),
"options": {
"priority": OnyxCeleryPriority.LOW,
"expires": BEAT_EXPIRES_DEFAULT,
},
},
{
"name": "check-for-indexing",
"task": OnyxCeleryTask.CHECK_FOR_INDEXING,
"schedule": timedelta(seconds=15),
"options": {
"priority": OnyxCeleryPriority.MEDIUM,
"expires": BEAT_EXPIRES_DEFAULT,
},
},
{
"name": "check-for-checkpoint-cleanup",
"task": OnyxCeleryTask.CHECK_FOR_CHECKPOINT_CLEANUP,
"schedule": timedelta(hours=1),
"options": {
"priority": OnyxCeleryPriority.LOW,
"expires": BEAT_EXPIRES_DEFAULT,
},
},
{
"name": "check-for-connector-deletion",
"task": OnyxCeleryTask.CHECK_FOR_CONNECTOR_DELETION,
"schedule": timedelta(seconds=20),
"options": {
"priority": OnyxCeleryPriority.MEDIUM,
"expires": BEAT_EXPIRES_DEFAULT,
},
},
{
"name": "check-for-vespa-sync",
"task": OnyxCeleryTask.CHECK_FOR_VESPA_SYNC_TASK,
"schedule": timedelta(seconds=20),
"options": {
"priority": OnyxCeleryPriority.MEDIUM,
"expires": BEAT_EXPIRES_DEFAULT,
},
},
{
"name": "check-for-user-file-folder-sync",
"task": OnyxCeleryTask.CHECK_FOR_USER_FILE_FOLDER_SYNC,
"schedule": timedelta(
days=1
), # This should essentially always be triggered manually for user folder updates.
"options": {
"priority": OnyxCeleryPriority.MEDIUM,
"expires": BEAT_EXPIRES_DEFAULT,
},
},
{
"name": "check-for-pruning",
"task": OnyxCeleryTask.CHECK_FOR_PRUNING,
"schedule": timedelta(seconds=20),
"options": {
"priority": OnyxCeleryPriority.MEDIUM,
"expires": BEAT_EXPIRES_DEFAULT,
},
},
{
"name": "check-for-doc-permissions-sync",
"task": OnyxCeleryTask.CHECK_FOR_DOC_PERMISSIONS_SYNC,
"schedule": timedelta(seconds=30),
"options": {
"priority": OnyxCeleryPriority.MEDIUM,
"expires": BEAT_EXPIRES_DEFAULT,
},
},
{
"name": "check-for-external-group-sync",
"task": OnyxCeleryTask.CHECK_FOR_EXTERNAL_GROUP_SYNC,
"schedule": timedelta(seconds=20),
"options": {
"priority": OnyxCeleryPriority.MEDIUM,
"expires": BEAT_EXPIRES_DEFAULT,
},
},
{
"name": "monitor-background-processes",
"task": OnyxCeleryTask.MONITOR_BACKGROUND_PROCESSES,
"schedule": timedelta(minutes=5),
"options": {
"priority": OnyxCeleryPriority.LOW,
"expires": BEAT_EXPIRES_DEFAULT,
"queue": OnyxCeleryQueues.MONITORING,
},
},
]
beat_task_templates.extend(
[
{
"name": "check-for-kg-processing",
"task": OnyxCeleryTask.CHECK_KG_PROCESSING,
"schedule": timedelta(seconds=60),
"options": {
"priority": OnyxCeleryPriority.MEDIUM,
"expires": BEAT_EXPIRES_DEFAULT,
if ENTERPRISE_EDITION_ENABLED:
beat_task_templates.extend(
[
{
"name": "check-for-doc-permissions-sync",
"task": OnyxCeleryTask.CHECK_FOR_DOC_PERMISSIONS_SYNC,
"schedule": timedelta(seconds=30),
"options": {
"priority": OnyxCeleryPriority.MEDIUM,
"expires": BEAT_EXPIRES_DEFAULT,
},
},
},
{
"name": "check-for-kg-processing-clustering-only",
"task": OnyxCeleryTask.CHECK_KG_PROCESSING_CLUSTERING_ONLY,
"schedule": timedelta(seconds=600),
"options": {
"priority": OnyxCeleryPriority.LOW,
"expires": BEAT_EXPIRES_DEFAULT,
{
"name": "check-for-external-group-sync",
"task": OnyxCeleryTask.CHECK_FOR_EXTERNAL_GROUP_SYNC,
"schedule": timedelta(seconds=20),
"options": {
"priority": OnyxCeleryPriority.MEDIUM,
"expires": BEAT_EXPIRES_DEFAULT,
},
},
},
{
"name": "check-for-indexing",
"task": OnyxCeleryTask.CHECK_FOR_INDEXING,
"schedule": timedelta(seconds=15),
"options": {
"priority": OnyxCeleryPriority.MEDIUM,
"expires": BEAT_EXPIRES_DEFAULT,
},
},
{
"name": "check-for-checkpoint-cleanup",
"task": OnyxCeleryTask.CHECK_FOR_CHECKPOINT_CLEANUP,
"schedule": timedelta(hours=1),
"options": {
"priority": OnyxCeleryPriority.LOW,
"expires": BEAT_EXPIRES_DEFAULT,
},
},
{
"name": "check-for-connector-deletion",
"task": OnyxCeleryTask.CHECK_FOR_CONNECTOR_DELETION,
"schedule": timedelta(seconds=20),
"options": {
"priority": OnyxCeleryPriority.MEDIUM,
"expires": BEAT_EXPIRES_DEFAULT,
},
},
{
"name": "check-for-vespa-sync",
"task": OnyxCeleryTask.CHECK_FOR_VESPA_SYNC_TASK,
"schedule": timedelta(seconds=20),
"options": {
"priority": OnyxCeleryPriority.MEDIUM,
"expires": BEAT_EXPIRES_DEFAULT,
},
},
{
"name": "check-for-user-file-folder-sync",
"task": OnyxCeleryTask.CHECK_FOR_USER_FILE_FOLDER_SYNC,
"schedule": timedelta(
days=1
), # This should essentially always be triggered manually for user folder updates.
"options": {
"priority": OnyxCeleryPriority.MEDIUM,
"expires": BEAT_EXPIRES_DEFAULT,
},
},
{
"name": "check-for-pruning",
"task": OnyxCeleryTask.CHECK_FOR_PRUNING,
"schedule": timedelta(seconds=20),
"options": {
"priority": OnyxCeleryPriority.MEDIUM,
"expires": BEAT_EXPIRES_DEFAULT,
},
},
{
"name": "check-for-doc-permissions-sync",
"task": OnyxCeleryTask.CHECK_FOR_DOC_PERMISSIONS_SYNC,
"schedule": timedelta(seconds=30),
"options": {
"priority": OnyxCeleryPriority.MEDIUM,
"expires": BEAT_EXPIRES_DEFAULT,
},
},
{
"name": "check-for-external-group-sync",
"task": OnyxCeleryTask.CHECK_FOR_EXTERNAL_GROUP_SYNC,
"schedule": timedelta(seconds=20),
"options": {
"priority": OnyxCeleryPriority.MEDIUM,
"expires": BEAT_EXPIRES_DEFAULT,
},
},
{
"name": "monitor-background-processes",
"task": OnyxCeleryTask.MONITOR_BACKGROUND_PROCESSES,
"schedule": timedelta(minutes=5),
"options": {
"priority": OnyxCeleryPriority.LOW,
"expires": BEAT_EXPIRES_DEFAULT,
"queue": OnyxCeleryQueues.MONITORING,
},
},
]
)
]
)
# Only add the LLM model update task if the API URL is configured
if LLM_MODEL_UPDATE_API_URL:

View File

@@ -38,7 +38,7 @@ from onyx.db.document import (
)
from onyx.db.document import get_document_ids_for_connector_credential_pair
from onyx.db.document_set import delete_document_set_cc_pair_relationship__no_commit
from onyx.db.engine import get_session_with_current_tenant
from onyx.db.engine.sql_engine import get_session_with_current_tenant
from onyx.db.enums import ConnectorCredentialPairStatus
from onyx.db.enums import SyncStatus
from onyx.db.enums import SyncType

Some files were not shown because too many files have changed in this diff Show More