Compare commits

..

79 Commits

Author SHA1 Message Date
Nik
c11d5fb62a fix(chat): conditional fade overlay and opacity for non-preferred panels
- Only show bottom gradient on non-preferred panels that actually
  overflow the preferred panel height cap
- Always dim non-preferred panels at 50% opacity (not just when capped)
- Add preferredIdx !== -1 guard and responses.find() for safer lookups
2026-04-02 12:06:40 -07:00
Nik
005c27975f fix(chat): address review comments on multi-model panels
- Use responses.find() instead of responses[modelIndex] array indexing
- Guard preferredIdx === -1 to prevent NaN in carousel transform
- Remove unused SvgEyeClosed import
- Remove unused isHighlighted field from MultiModelResponse
- Remove unused modelIndex prop from MultiModelPanel
2026-04-02 12:06:40 -07:00
Nik
c7c689f6cc fix(chat): use pixel values for carousel centering transform
translateX(calc(50% - ...)) used 50% of the track's own width, not
the container's. Use measured trackContainerW for correct centering.
2026-04-02 12:06:40 -07:00
Nik
c9f124b109 docs(chat): add JSDoc to MultiModelResponseView and MultiModelPanel 2026-04-02 12:06:40 -07:00
Nik
f6a821b66d feat(chat): add multi-model response panels
MultiModelResponseView renders 2-3 LLM responses side-by-side with
generation and selection modes. MultiModelPanel wraps AgentMessage for
each model's output. Adds hideFooter prop to AgentMessage for
non-preferred panel styling.
2026-04-02 12:06:40 -07:00
Nik
7272a1b4c4 refactor(LLMPopover): use ModelListContent for model list rendering
LLMPopover now delegates search, grouping, and model list rendering
to ModelListContent, eliminating duplicated code. Keeps popover
wrapper, trigger button, and temperature slider.
2026-04-02 12:06:24 -07:00
Nik
503fab60dc refactor(ModelListContent): replace raw HTML with library components
Replace raw <button> and wrapper <div>s with LineItem and Section
from the component library.
2026-04-02 11:12:57 -07:00
Nik
4b174712cc refactor(ModelListContent): replace legacy accordion with Collapsible
Switch from @/components/ui/accordion (legacy) to
@/refresh-components/Collapsible (Radix wrapper). Matches the visual
layout of the existing LLMPopover: collapsible groups with provider
icons, chevron indicators, auto-expand selected group, force-expand
all on search.
2026-04-02 10:37:36 -07:00
Nik
327b0ff4a1 chore(hooks): remove useMultiModelChat unit test
Hook logic is simple enough to not warrant a standalone test file.
2026-04-02 09:43:43 -07:00
Nik
5a8b6f0217 fix(chat): anchor popover above clicked pill via virtualRef
Replace static Popover.Anchor with virtualRef so the model list
positions above whichever button was clicked (+ button or pill).
Also add avoidCollisions={false} to prevent Radix from flipping
the popover to the bottom.
2026-04-02 03:20:43 -07:00
Nik
9b2f8c8016 fix(chat): refactor ModelSelector separators and click zones
- Replace raw div separators with Separator component
- Change size from md to lg per design mock
- Add two-click-zone: main area opens replace popover, X icon removes
2026-04-02 02:56:22 -07:00
Nik
ec5eb75722 fix(chat): address review comments on model selector and hook
- Fix model_provider mapping: use m.provider instead of m.name
- Pass isLoading to ModelListContent to show loading state
- Cap restoreFromModelNames at MAX_MODELS
- Remove redundant llmManager from useEffect deps
- Use Separator component instead of raw divs
- Add two-click-zone on SelectButton (main area→replace, X→remove)
2026-04-02 02:55:44 -07:00
Nik
3ca304d228 fix(chat): move X icon inside SelectButton, use size md
Match Figma mock: rightIcon={SvgX} puts the remove icon inside the
pill. Size md makes pills larger. Clicking the pill removes the model.
2026-04-02 02:17:06 -07:00
Nik
a5c982fb75 fix(chat): anchor popover to pills, fix separator when at max models
Popover flew to upper-left at 3 models because no Trigger existed.
Wrap pills in Popover.Anchor so content positions correctly. Leading
separator now only shown when + button is visible.
2026-04-02 02:10:21 -07:00
Nik
2d3c838707 refactor(chat): extract ModelListContent, drop accordion for flat section headers
ModelSelector was reimplementing LLMPopover internals (search, accordion,
item rendering). Extract shared ModelListContent component that uses the
ResourcePopover pattern — flat section headers with provider icon + separator
line instead of collapsible accordions. ModelSelector now just handles
selection logic and pill rendering.
2026-04-02 02:02:10 -07:00
Nik
d936efeffd refactor(chat): reuse opal components in ModelSelector
Replace hand-rolled ModelPill with opal SelectButton, custom ModelItem
with LineItem, raw Radix AccordionPrimitive with shared Accordion, and
reuse buildLlmOptions/groupLlmOptions from LLMPopover.
2026-04-02 02:02:10 -07:00
Nik
e1e7693870 feat(chat): add multi-model selector and chat hook
ModelSelector component with add/remove/replace semantics for up to 3
concurrent LLMs. useMultiModelChat hook manages model selection state
and streaming coordination.
2026-04-02 02:02:10 -07:00
Nikolas Garza
df14bbe0e2 feat(chat): add frontend types and API helpers for multi-model streaming (#9648) 2026-04-02 08:52:21 +00:00
Nikolas Garza
3db1ad82ce feat(chat): add multi-model parallel streaming backend (#9647) 2026-04-02 08:03:54 +00:00
Yuhong Sun
1e7882529c docs: Chat README (#9853) 2026-04-02 00:47:31 -07:00
Raunak Bhagat
5d405cfa2d refactor: Hook Extensions edits (#9831) 2026-04-02 06:55:59 +00:00
Nikolas Garza
de3a253ea9 refactor(emitter): replace bus-polling with merge-queue (#9803) 2026-04-02 06:34:12 +00:00
Justin Tahara
d6946a66a5 fix(llm): Azure custom model support + Mistral tool call message ordering (#9729) 2026-04-02 06:22:13 +00:00
Raunak Bhagat
11835a0268 fix(opal): guard opal/interactive's onClick handlers against React portal event bubbling (#9850) 2026-04-02 05:37:00 +00:00
Danelegend
519fb61cc7 fix(xlsx): Improve empty row/col handling (#9288) 2026-04-02 02:52:20 +00:00
acaprau
02671937fb chore(opensearch): Increase DEFAULT_NUM_HYBRID_SUBQUERY_CANDIDATES to 500, disable profiling by default (#9844)
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
2026-04-02 00:49:21 +00:00
acaprau
1466158c1e fix(opensearch): Add Vespa server-side timeout for the migration (#9843) 2026-04-02 00:20:59 +00:00
Bo-Onyx
073cf11c42 feat(hook): update hook doc link and reference (#9841) 2026-04-02 00:12:04 +00:00
Justin Tahara
a2b0c15027 fix(db): remove unnecessary selectinload(User.memories) from auth paths (#9838) 2026-04-01 23:51:06 +00:00
Raunak Bhagat
a462678ddd refactor(opal): split SelectCard's sizeVariant prop into paddingVariant + roundingVariant (#9830) 2026-04-01 23:15:20 +00:00
Bo-Onyx
c50d2739b8 feat(hook): integrate document ingestion hook point (#9810) 2026-04-01 23:08:12 +00:00
dependabot[bot]
0214c64cab chore(deps): bump aiohttp from 3.13.3 to 3.13.4 (#9839)
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Jamison Lahman <jamison@lahman.dev>
2026-04-01 22:48:37 +00:00
Evan Lohn
d09dc6a6f1 refactor: drive connector (#9834) 2026-04-01 22:40:27 +00:00
Jamison Lahman
79a81f37d5 chore(gha): cleanup connector tests (#9836) 2026-04-01 22:17:43 +00:00
Bo-Onyx
5b8af95007 feat(hook): frontend ee (#9825) 2026-04-01 19:18:18 +00:00
Yuhong Sun
b40935339f README Update (#9833) 2026-04-01 12:17:10 -07:00
Yuhong Sun
4a50bfc7ae docs(readme): README and Contrib (#9829) 2026-04-01 11:53:02 -07:00
dependabot[bot]
4c9135ecdf chore(deps): bump fastmcp from 3.0.2 to 3.2.0 (#9814)
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Jamison Lahman <jamison@lahman.dev>
2026-04-01 18:21:27 +00:00
Jamison Lahman
fe496da134 chore(deployment): rework trivy job (#9780) 2026-04-01 10:55:45 -07:00
Nikolas Garza
7eb8b335c0 refactor(swr): migrate all inline cache keys to SWR_KEYS registry (#9782) 2026-04-01 17:24:40 +00:00
Jamison Lahman
183e3b5ec3 fix(fe): foldable buttons unfold on tab (#9828) 2026-04-01 16:30:27 +00:00
Jamison Lahman
9c5c42479c fix(a11y): migrate some buttons to Hoverable (#9778) 2026-04-01 15:35:07 +00:00
Jamison Lahman
514f8eedb8 chore(fe): prefer Button w/ href to wrapped Link (#9774) 2026-04-01 15:34:37 +00:00
Raunak Bhagat
eb6bd42c1e refactor(admin): revamp Service Accounts page and AdminListHeader (#9824) 2026-04-01 15:11:01 +00:00
Danelegend
953cc28625 feat(files): Inject file metadata over content for certain files (#9786) 2026-04-01 13:19:11 +00:00
Danelegend
de0f42f6cc refactor(files): Port csv type to tabular (#9785) 2026-04-01 03:37:13 +00:00
Raunak Bhagat
7ecefdc90f refactor(opal): split Card sizeVariant into padding + rounding (#9823) 2026-04-01 03:32:08 +00:00
Danelegend
21fc013893 feat(file-upload): Upload files exceeding tokens but skip indexing (#9751) 2026-04-01 02:14:51 +00:00
Justin Tahara
a1c3a68ba4 fix(perf): optimize chat sessions query to prevent DB cascading failures (#9802) 2026-04-01 01:28:37 +00:00
Evan Lohn
4fb175ae65 fix: install early exit (#9818) 2026-04-01 01:09:05 +00:00
Evan Lohn
800ad326df fix: discord token validation (#9817) 2026-04-01 01:08:38 +00:00
Bo-Onyx
6b920e8a3e feat(hook): refactor under ee (#9776) 2026-04-01 01:07:55 +00:00
Justin Tahara
ef3760796d feat(rds): Adding IO Metrics Alarms (#9789) 2026-04-01 01:07:45 +00:00
Jessica Singh
fa5b90df92 fix(connectors): fix reindex on paused file connectors (#9812) 2026-03-31 23:10:09 +00:00
Evan Lohn
53953ac4fa chore: fix indexing log2 (#9811) 2026-03-31 21:02:54 +00:00
Yuhong Sun
26bb5c990c chore: Rag script for benchmark/regression (#9781) 2026-03-31 20:46:17 +00:00
Evan Lohn
27b4ed301f chore: fix batch logging (#9808) 2026-03-31 20:10:33 +00:00
Jessica Singh
93ec270ccc feat(voice): VAD auto-stop only when auto-send is enabled (#9809) 2026-03-31 19:31:31 +00:00
Raunak Bhagat
9e2d6c8a1d refactor(admin): code-interpreter (#9790) 2026-03-31 19:08:55 +00:00
Nikolas Garza
fc934214d0 perf(swr): add SWR_KEYS registry and skip revalidation for stable hooks (#9695) 2026-03-31 19:07:42 +00:00
Raunak Bhagat
48fc45a0cd refactor(admin): web-search (#9761) 2026-03-31 19:04:18 +00:00
Jessica Singh
009266e53e fix(llm): when multiple providers are same type ensure name is prioritized when default (#9777) 2026-03-31 19:03:38 +00:00
Raunak Bhagat
ffb9df7308 refactor(admin): LLM Config (#9806) 2026-03-31 19:03:17 +00:00
Raunak Bhagat
b0f5e0b8d9 refactor(admin): image-generation (#9769) 2026-03-31 18:13:23 +00:00
acaprau
43aea5d614 chore(opensearch): Add Grafana dashboard for retrieval (#9657)
Co-authored-by: cubic-dev-ai[bot] <191113872+cubic-dev-ai[bot]@users.noreply.github.com>
2026-03-31 16:56:40 +00:00
Bo-Onyx
593d82f431 feat(hook): hook status and logs (#9770) 2026-03-31 16:10:12 +00:00
Ben Wu
adf5691b5f feat(canvas 2/4): Canvas Connector data fetching (#9386) 2026-03-31 03:07:05 +00:00
Nikolas Garza
c1a8a5bd83 fix(tenants): run migrations on pool tenants before assigning to new users (#9788) 2026-03-31 01:24:01 +00:00
Justin Tahara
8fd486da99 feat(rds): Add Freeable Memory alert (#9787) 2026-03-31 00:59:30 +00:00
Raunak Bhagat
4bda4d3637 refactor: migrate away from cards/Select (#9771) 2026-03-31 00:27:01 +00:00
Justin Tahara
13c25eadad feat(rds): Adding CPU Alerts (#9784) 2026-03-31 00:22:15 +00:00
Justin Tahara
1f244e6388 feat(eks): Adding Cloudwatch logging (#9783) 2026-03-30 23:52:44 +00:00
Nikolas Garza
18b0416d30 feat(sentry): enable frontend source map uploads in cloud CI (#9775) 2026-03-30 23:42:57 +00:00
Nikolas Garza
4bc0bc1efb feat(helm): add Grafana dashboard provisioning (#9725) 2026-03-30 23:42:32 +00:00
Justin Tahara
1555217061 feat(rds): Adding RDS Snapshosts (#9779) 2026-03-30 23:17:08 +00:00
Nikolas Garza
d177a833f0 feat(sentry): add release tracking to backend and frontend (#9773) 2026-03-30 22:35:38 +00:00
Jamison Lahman
086997d3c5 chore(types): fix IconButton size props (#9772) 2026-03-30 21:40:25 +00:00
dependabot[bot]
dccec78397 chore(deps): bump helm/chart-testing-action from b5eebdd9998021f29756c53432f48dab66394810 to 2e2940618cb426dce2999631d543b53cdcfc8527 (#9764)
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-03-30 14:41:01 -07:00
Jamison Lahman
0123133621 chore(fe): polish Query History table (#9767) 2026-03-30 21:30:13 +00:00
338 changed files with 15169 additions and 7472 deletions

View File

@@ -704,6 +704,9 @@ jobs:
NEXT_PUBLIC_FORGOT_PASSWORD_ENABLED=true
NEXT_PUBLIC_INCLUDE_ERROR_POPUP_SUPPORT_LINK=true
NODE_OPTIONS=--max-old-space-size=8192
SENTRY_RELEASE=${{ github.sha }}
secrets: |
sentry_auth_token=${{ secrets.SENTRY_AUTH_TOKEN }}
cache-from: |
type=registry,ref=${{ env.RUNS_ON_ECR_CACHE }}:cloudweb-cache-amd64
type=registry,ref=${{ env.REGISTRY_IMAGE }}:latest
@@ -786,6 +789,9 @@ jobs:
NEXT_PUBLIC_FORGOT_PASSWORD_ENABLED=true
NEXT_PUBLIC_INCLUDE_ERROR_POPUP_SUPPORT_LINK=true
NODE_OPTIONS=--max-old-space-size=8192
SENTRY_RELEASE=${{ github.sha }}
secrets: |
sentry_auth_token=${{ secrets.SENTRY_AUTH_TOKEN }}
cache-from: |
type=registry,ref=${{ env.RUNS_ON_ECR_CACHE }}:cloudweb-cache-arm64
type=registry,ref=${{ env.REGISTRY_IMAGE }}:latest
@@ -1503,232 +1509,105 @@ jobs:
$(printf '%s\n' "${META_TAGS}" | xargs -I {} echo -t {}) \
$IMAGES
trivy-scan-web:
trivy-scan:
needs:
- determine-builds
- merge-web
if: needs.merge-web.result == 'success'
runs-on:
- runs-on
- runner=2cpu-linux-arm64
- run-id=${{ github.run_id }}-trivy-scan-web
- extras=ecr-cache
timeout-minutes: 90
environment: release
env:
REGISTRY_IMAGE: onyxdotapp/onyx-web-server
steps:
- uses: runs-on/action@cd2b598b0515d39d78c38a02d529db87d2196d1e # ratchet:runs-on/action@v2
- name: Configure AWS credentials
uses: aws-actions/configure-aws-credentials@8df5847569e6427dd6c4fb1cf565c83acfa8afa7
with:
role-to-assume: ${{ secrets.AWS_OIDC_ROLE_ARN }}
aws-region: us-east-2
- name: Get AWS Secrets
uses: aws-actions/aws-secretsmanager-get-secrets@a9a7eb4e2f2871d30dc5b892576fde60a2ecc802
with:
secret-ids: |
DOCKER_USERNAME, deploy/docker-username
DOCKER_TOKEN, deploy/docker-token
parse-json-secrets: true
- name: Run Trivy vulnerability scanner
uses: nick-fields/retry@ce71cc2ab81d554ebbe88c79ab5975992d79ba08 # ratchet:nick-fields/retry@v3
with:
timeout_minutes: 30
max_attempts: 3
retry_wait_seconds: 10
command: |
if [ "${{ needs.determine-builds.outputs.is-test-run }}" == "true" ]; then
SCAN_IMAGE="${{ env.RUNS_ON_ECR_CACHE }}:web-${{ needs.determine-builds.outputs.sanitized-tag }}"
else
SCAN_IMAGE="docker.io/${{ env.REGISTRY_IMAGE }}:${{ github.ref_name }}"
fi
docker run --rm -v $HOME/.cache/trivy:/root/.cache/trivy \
-e TRIVY_DB_REPOSITORY="public.ecr.aws/aquasecurity/trivy-db:2" \
-e TRIVY_JAVA_DB_REPOSITORY="public.ecr.aws/aquasecurity/trivy-java-db:1" \
-e TRIVY_USERNAME="${{ env.DOCKER_USERNAME }}" \
-e TRIVY_PASSWORD="${{ env.DOCKER_TOKEN }}" \
aquasec/trivy@sha256:a22415a38938a56c379387a8163fcb0ce38b10ace73e593475d3658d578b2436 \
image \
--skip-version-check \
--timeout 20m \
--severity CRITICAL,HIGH \
${SCAN_IMAGE}
trivy-scan-web-cloud:
needs:
- determine-builds
- merge-web-cloud
if: needs.merge-web-cloud.result == 'success'
runs-on:
- runs-on
- runner=2cpu-linux-arm64
- run-id=${{ github.run_id }}-trivy-scan-web-cloud
- extras=ecr-cache
timeout-minutes: 90
environment: release
env:
REGISTRY_IMAGE: onyxdotapp/onyx-web-server-cloud
steps:
- uses: runs-on/action@cd2b598b0515d39d78c38a02d529db87d2196d1e # ratchet:runs-on/action@v2
- name: Configure AWS credentials
uses: aws-actions/configure-aws-credentials@8df5847569e6427dd6c4fb1cf565c83acfa8afa7
with:
role-to-assume: ${{ secrets.AWS_OIDC_ROLE_ARN }}
aws-region: us-east-2
- name: Get AWS Secrets
uses: aws-actions/aws-secretsmanager-get-secrets@a9a7eb4e2f2871d30dc5b892576fde60a2ecc802
with:
secret-ids: |
DOCKER_USERNAME, deploy/docker-username
DOCKER_TOKEN, deploy/docker-token
parse-json-secrets: true
- name: Run Trivy vulnerability scanner
uses: nick-fields/retry@ce71cc2ab81d554ebbe88c79ab5975992d79ba08 # ratchet:nick-fields/retry@v3
with:
timeout_minutes: 30
max_attempts: 3
retry_wait_seconds: 10
command: |
if [ "${{ needs.determine-builds.outputs.is-test-run }}" == "true" ]; then
SCAN_IMAGE="${{ env.RUNS_ON_ECR_CACHE }}:web-cloud-${{ needs.determine-builds.outputs.sanitized-tag }}"
else
SCAN_IMAGE="docker.io/${{ env.REGISTRY_IMAGE }}:${{ github.ref_name }}"
fi
docker run --rm -v $HOME/.cache/trivy:/root/.cache/trivy \
-e TRIVY_DB_REPOSITORY="public.ecr.aws/aquasecurity/trivy-db:2" \
-e TRIVY_JAVA_DB_REPOSITORY="public.ecr.aws/aquasecurity/trivy-java-db:1" \
-e TRIVY_USERNAME="${{ env.DOCKER_USERNAME }}" \
-e TRIVY_PASSWORD="${{ env.DOCKER_TOKEN }}" \
aquasec/trivy@sha256:a22415a38938a56c379387a8163fcb0ce38b10ace73e593475d3658d578b2436 \
image \
--skip-version-check \
--timeout 20m \
--severity CRITICAL,HIGH \
${SCAN_IMAGE}
trivy-scan-backend:
needs:
- determine-builds
- merge-backend
if: needs.merge-backend.result == 'success'
- merge-model-server
if: >-
always() && !cancelled() &&
(needs.merge-web.result == 'success' ||
needs.merge-web-cloud.result == 'success' ||
needs.merge-backend.result == 'success' ||
needs.merge-model-server.result == 'success')
runs-on:
- runs-on
- runner=2cpu-linux-arm64
- run-id=${{ github.run_id }}-trivy-scan-backend
- run-id=${{ github.run_id }}-trivy-scan-${{ matrix.component }}
- extras=ecr-cache
timeout-minutes: 90
environment: release
env:
REGISTRY_IMAGE: ${{ contains(github.ref_name, 'cloud') && 'onyxdotapp/onyx-backend-cloud' || 'onyxdotapp/onyx-backend' }}
permissions:
security-events: write # needed for SARIF uploads
timeout-minutes: 10
strategy:
fail-fast: false
matrix:
include:
- component: web
registry-image: onyxdotapp/onyx-web-server
- component: web-cloud
registry-image: onyxdotapp/onyx-web-server-cloud
- component: backend
registry-image: ${{ contains(github.ref_name, 'cloud') && 'onyxdotapp/onyx-backend-cloud' || 'onyxdotapp/onyx-backend' }}
trivyignore: backend/.trivyignore
- component: model-server
registry-image: ${{ contains(github.ref_name, 'cloud') && 'onyxdotapp/onyx-model-server-cloud' || 'onyxdotapp/onyx-model-server' }}
steps:
- name: Check if this scan should run
id: should-run
run: |
case "$COMPONENT" in
web) RESULT="$MERGE_WEB" ;;
web-cloud) RESULT="$MERGE_WEB_CLOUD" ;;
backend) RESULT="$MERGE_BACKEND" ;;
model-server) RESULT="$MERGE_MODEL_SERVER" ;;
esac
if [ "$RESULT" == "success" ]; then
echo "run=true" >> "$GITHUB_OUTPUT"
else
echo "run=false" >> "$GITHUB_OUTPUT"
fi
env:
COMPONENT: ${{ matrix.component }}
MERGE_WEB: ${{ needs.merge-web.result }}
MERGE_WEB_CLOUD: ${{ needs.merge-web-cloud.result }}
MERGE_BACKEND: ${{ needs.merge-backend.result }}
MERGE_MODEL_SERVER: ${{ needs.merge-model-server.result }}
- uses: runs-on/action@cd2b598b0515d39d78c38a02d529db87d2196d1e # ratchet:runs-on/action@v2
if: steps.should-run.outputs.run == 'true'
- name: Checkout
if: steps.should-run.outputs.run == 'true' && matrix.trivyignore != ''
uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # ratchet:actions/checkout@v6
with:
persist-credentials: false
- name: Configure AWS credentials
uses: aws-actions/configure-aws-credentials@8df5847569e6427dd6c4fb1cf565c83acfa8afa7
with:
role-to-assume: ${{ secrets.AWS_OIDC_ROLE_ARN }}
aws-region: us-east-2
- name: Get AWS Secrets
uses: aws-actions/aws-secretsmanager-get-secrets@a9a7eb4e2f2871d30dc5b892576fde60a2ecc802
with:
secret-ids: |
DOCKER_USERNAME, deploy/docker-username
DOCKER_TOKEN, deploy/docker-token
parse-json-secrets: true
- name: Determine scan image
if: steps.should-run.outputs.run == 'true'
id: scan-image
run: |
if [ "$IS_TEST_RUN" == "true" ]; then
echo "image=${RUNS_ON_ECR_CACHE}:${TAG_PREFIX}-${SANITIZED_TAG}" >> "$GITHUB_OUTPUT"
else
echo "image=docker.io/${REGISTRY_IMAGE}:${REF_NAME}" >> "$GITHUB_OUTPUT"
fi
env:
IS_TEST_RUN: ${{ needs.determine-builds.outputs.is-test-run }}
TAG_PREFIX: ${{ matrix.component }}
SANITIZED_TAG: ${{ needs.determine-builds.outputs.sanitized-tag }}
REGISTRY_IMAGE: ${{ matrix.registry-image }}
REF_NAME: ${{ github.ref_name }}
- name: Run Trivy vulnerability scanner
uses: nick-fields/retry@ce71cc2ab81d554ebbe88c79ab5975992d79ba08 # ratchet:nick-fields/retry@v3
if: steps.should-run.outputs.run == 'true'
uses: aquasecurity/trivy-action@57a97c7e7821a5776cebc9bb87c984fa69cba8f1 # ratchet:aquasecurity/trivy-action@v0.35.0
with:
timeout_minutes: 30
max_attempts: 3
retry_wait_seconds: 10
command: |
if [ "${{ needs.determine-builds.outputs.is-test-run }}" == "true" ]; then
SCAN_IMAGE="${{ env.RUNS_ON_ECR_CACHE }}:backend-${{ needs.determine-builds.outputs.sanitized-tag }}"
else
SCAN_IMAGE="docker.io/${{ env.REGISTRY_IMAGE }}:${{ github.ref_name }}"
fi
docker run --rm -v $HOME/.cache/trivy:/root/.cache/trivy \
-v ${{ github.workspace }}/backend/.trivyignore:/tmp/.trivyignore:ro \
-e TRIVY_DB_REPOSITORY="public.ecr.aws/aquasecurity/trivy-db:2" \
-e TRIVY_JAVA_DB_REPOSITORY="public.ecr.aws/aquasecurity/trivy-java-db:1" \
-e TRIVY_USERNAME="${{ env.DOCKER_USERNAME }}" \
-e TRIVY_PASSWORD="${{ env.DOCKER_TOKEN }}" \
aquasec/trivy@sha256:a22415a38938a56c379387a8163fcb0ce38b10ace73e593475d3658d578b2436 \
image \
--skip-version-check \
--timeout 20m \
--severity CRITICAL,HIGH \
--ignorefile /tmp/.trivyignore \
${SCAN_IMAGE}
image-ref: ${{ steps.scan-image.outputs.image }}
severity: CRITICAL,HIGH
format: "sarif"
output: "trivy-results.sarif"
trivyignores: ${{ matrix.trivyignore }}
env:
TRIVY_USERNAME: ${{ secrets.DOCKER_USERNAME }}
TRIVY_PASSWORD: ${{ secrets.DOCKER_TOKEN }}
trivy-scan-model-server:
needs:
- determine-builds
- merge-model-server
if: needs.merge-model-server.result == 'success'
runs-on:
- runs-on
- runner=2cpu-linux-arm64
- run-id=${{ github.run_id }}-trivy-scan-model-server
- extras=ecr-cache
timeout-minutes: 90
environment: release
env:
REGISTRY_IMAGE: ${{ contains(github.ref_name, 'cloud') && 'onyxdotapp/onyx-model-server-cloud' || 'onyxdotapp/onyx-model-server' }}
steps:
- uses: runs-on/action@cd2b598b0515d39d78c38a02d529db87d2196d1e # ratchet:runs-on/action@v2
- name: Configure AWS credentials
uses: aws-actions/configure-aws-credentials@8df5847569e6427dd6c4fb1cf565c83acfa8afa7
- name: Upload Trivy scan results to GitHub Security tab
if: steps.should-run.outputs.run == 'true'
uses: github/codeql-action/upload-sarif@ba454b8ab46733eb6145342877cd148270bb77ab
with:
role-to-assume: ${{ secrets.AWS_OIDC_ROLE_ARN }}
aws-region: us-east-2
- name: Get AWS Secrets
uses: aws-actions/aws-secretsmanager-get-secrets@a9a7eb4e2f2871d30dc5b892576fde60a2ecc802
with:
secret-ids: |
DOCKER_USERNAME, deploy/docker-username
DOCKER_TOKEN, deploy/docker-token
parse-json-secrets: true
- name: Run Trivy vulnerability scanner
uses: nick-fields/retry@ce71cc2ab81d554ebbe88c79ab5975992d79ba08 # ratchet:nick-fields/retry@v3
with:
timeout_minutes: 30
max_attempts: 3
retry_wait_seconds: 10
command: |
if [ "${{ needs.determine-builds.outputs.is-test-run }}" == "true" ]; then
SCAN_IMAGE="${{ env.RUNS_ON_ECR_CACHE }}:model-server-${{ needs.determine-builds.outputs.sanitized-tag }}"
else
SCAN_IMAGE="docker.io/${{ env.REGISTRY_IMAGE }}:${{ github.ref_name }}"
fi
docker run --rm -v $HOME/.cache/trivy:/root/.cache/trivy \
-e TRIVY_DB_REPOSITORY="public.ecr.aws/aquasecurity/trivy-db:2" \
-e TRIVY_JAVA_DB_REPOSITORY="public.ecr.aws/aquasecurity/trivy-java-db:1" \
-e TRIVY_USERNAME="${{ env.DOCKER_USERNAME }}" \
-e TRIVY_PASSWORD="${{ env.DOCKER_TOKEN }}" \
aquasec/trivy@sha256:a22415a38938a56c379387a8163fcb0ce38b10ace73e593475d3658d578b2436 \
image \
--skip-version-check \
--timeout 20m \
--severity CRITICAL,HIGH \
${SCAN_IMAGE}
sarif_file: "trivy-results.sarif"
notify-slack-on-failure:
needs:

View File

@@ -41,7 +41,7 @@ jobs:
version: v3.19.0
- name: Set up chart-testing
uses: helm/chart-testing-action@b5eebdd9998021f29756c53432f48dab66394810
uses: helm/chart-testing-action@2e2940618cb426dce2999631d543b53cdcfc8527
with:
uv_version: "0.9.9"

View File

@@ -22,132 +22,40 @@ on:
- cron: "0 16 * * *"
permissions:
id-token: write # Required for OIDC-based AWS credential exchange
contents: read
env:
# AWS
AWS_ACCESS_KEY_ID_DAILY_CONNECTOR_TESTS: ${{ secrets.AWS_ACCESS_KEY_ID_DAILY_CONNECTOR_TESTS }}
AWS_SECRET_ACCESS_KEY_DAILY_CONNECTOR_TESTS: ${{ secrets.AWS_SECRET_ACCESS_KEY_DAILY_CONNECTOR_TESTS }}
# Cloudflare R2
PYTHONPATH: ./backend
DISABLE_TELEMETRY: "true"
R2_ACCOUNT_ID_DAILY_CONNECTOR_TESTS: ${{ vars.R2_ACCOUNT_ID_DAILY_CONNECTOR_TESTS }}
R2_ACCESS_KEY_ID_DAILY_CONNECTOR_TESTS: ${{ secrets.R2_ACCESS_KEY_ID_DAILY_CONNECTOR_TESTS }}
R2_SECRET_ACCESS_KEY_DAILY_CONNECTOR_TESTS: ${{ secrets.R2_SECRET_ACCESS_KEY_DAILY_CONNECTOR_TESTS }}
# Google Cloud Storage
GCS_ACCESS_KEY_ID_DAILY_CONNECTOR_TESTS: ${{ secrets.GCS_ACCESS_KEY_ID_DAILY_CONNECTOR_TESTS }}
GCS_SECRET_ACCESS_KEY_DAILY_CONNECTOR_TESTS: ${{ secrets.GCS_SECRET_ACCESS_KEY_DAILY_CONNECTOR_TESTS }}
# Confluence
CONFLUENCE_TEST_SPACE_URL: ${{ vars.CONFLUENCE_TEST_SPACE_URL }}
CONFLUENCE_TEST_SPACE: ${{ vars.CONFLUENCE_TEST_SPACE }}
CONFLUENCE_TEST_PAGE_ID: ${{ secrets.CONFLUENCE_TEST_PAGE_ID }}
CONFLUENCE_USER_NAME: ${{ vars.CONFLUENCE_USER_NAME }}
CONFLUENCE_ACCESS_TOKEN: ${{ secrets.CONFLUENCE_ACCESS_TOKEN }}
CONFLUENCE_ACCESS_TOKEN_SCOPED: ${{ secrets.CONFLUENCE_ACCESS_TOKEN_SCOPED }}
# Jira
JIRA_BASE_URL: ${{ secrets.JIRA_BASE_URL }}
JIRA_USER_EMAIL: ${{ secrets.JIRA_USER_EMAIL }}
JIRA_API_TOKEN: ${{ secrets.JIRA_API_TOKEN }}
JIRA_API_TOKEN_SCOPED: ${{ secrets.JIRA_API_TOKEN_SCOPED }}
# Gong
GONG_ACCESS_KEY: ${{ secrets.GONG_ACCESS_KEY }}
GONG_ACCESS_KEY_SECRET: ${{ secrets.GONG_ACCESS_KEY_SECRET }}
# Google
GOOGLE_DRIVE_SERVICE_ACCOUNT_JSON_STR: ${{ secrets.GOOGLE_DRIVE_SERVICE_ACCOUNT_JSON_STR }}
GOOGLE_DRIVE_OAUTH_CREDENTIALS_JSON_STR_TEST_USER_1: ${{ secrets.GOOGLE_DRIVE_OAUTH_CREDENTIALS_JSON_STR_TEST_USER_1 }}
GOOGLE_DRIVE_OAUTH_CREDENTIALS_JSON_STR: ${{ secrets.GOOGLE_DRIVE_OAUTH_CREDENTIALS_JSON_STR }}
GOOGLE_GMAIL_SERVICE_ACCOUNT_JSON_STR: ${{ secrets.GOOGLE_GMAIL_SERVICE_ACCOUNT_JSON_STR }}
GOOGLE_GMAIL_OAUTH_CREDENTIALS_JSON_STR: ${{ secrets.GOOGLE_GMAIL_OAUTH_CREDENTIALS_JSON_STR }}
# Slab
SLAB_BOT_TOKEN: ${{ secrets.SLAB_BOT_TOKEN }}
# Zendesk
ZENDESK_SUBDOMAIN: ${{ secrets.ZENDESK_SUBDOMAIN }}
ZENDESK_EMAIL: ${{ secrets.ZENDESK_EMAIL }}
ZENDESK_TOKEN: ${{ secrets.ZENDESK_TOKEN }}
# Salesforce
SF_USERNAME: ${{ vars.SF_USERNAME }}
SF_PASSWORD: ${{ secrets.SF_PASSWORD }}
SF_SECURITY_TOKEN: ${{ secrets.SF_SECURITY_TOKEN }}
# Hubspot
HUBSPOT_ACCESS_TOKEN: ${{ secrets.HUBSPOT_ACCESS_TOKEN }}
# IMAP
IMAP_HOST: ${{ vars.IMAP_HOST }}
IMAP_USERNAME: ${{ vars.IMAP_USERNAME }}
IMAP_PASSWORD: ${{ secrets.IMAP_PASSWORD }}
IMAP_MAILBOXES: ${{ vars.IMAP_MAILBOXES }}
# Airtable
AIRTABLE_TEST_BASE_ID: ${{ vars.AIRTABLE_TEST_BASE_ID }}
AIRTABLE_TEST_TABLE_ID: ${{ vars.AIRTABLE_TEST_TABLE_ID }}
AIRTABLE_TEST_TABLE_NAME: ${{ vars.AIRTABLE_TEST_TABLE_NAME }}
AIRTABLE_ACCESS_TOKEN: ${{ secrets.AIRTABLE_ACCESS_TOKEN }}
# Sharepoint
SHAREPOINT_CLIENT_ID: ${{ vars.SHAREPOINT_CLIENT_ID }}
SHAREPOINT_CLIENT_SECRET: ${{ secrets.SHAREPOINT_CLIENT_SECRET }}
SHAREPOINT_CLIENT_DIRECTORY_ID: ${{ vars.SHAREPOINT_CLIENT_DIRECTORY_ID }}
SHAREPOINT_SITE: ${{ vars.SHAREPOINT_SITE }}
PERM_SYNC_SHAREPOINT_CLIENT_ID: ${{ secrets.PERM_SYNC_SHAREPOINT_CLIENT_ID }}
PERM_SYNC_SHAREPOINT_PRIVATE_KEY: ${{ secrets.PERM_SYNC_SHAREPOINT_PRIVATE_KEY }}
PERM_SYNC_SHAREPOINT_CERTIFICATE_PASSWORD: ${{ secrets.PERM_SYNC_SHAREPOINT_CERTIFICATE_PASSWORD }}
PERM_SYNC_SHAREPOINT_DIRECTORY_ID: ${{ secrets.PERM_SYNC_SHAREPOINT_DIRECTORY_ID }}
# Github
ACCESS_TOKEN_GITHUB: ${{ secrets.ACCESS_TOKEN_GITHUB }}
# Gitlab
GITLAB_ACCESS_TOKEN: ${{ secrets.GITLAB_ACCESS_TOKEN }}
# Gitbook
GITBOOK_SPACE_ID: ${{ secrets.GITBOOK_SPACE_ID }}
GITBOOK_API_KEY: ${{ secrets.GITBOOK_API_KEY }}
# Notion
NOTION_INTEGRATION_TOKEN: ${{ secrets.NOTION_INTEGRATION_TOKEN }}
# Highspot
HIGHSPOT_KEY: ${{ secrets.HIGHSPOT_KEY }}
HIGHSPOT_SECRET: ${{ secrets.HIGHSPOT_SECRET }}
# Slack
SLACK_BOT_TOKEN: ${{ secrets.SLACK_BOT_TOKEN }}
# Discord
DISCORD_CONNECTOR_BOT_TOKEN: ${{ secrets.DISCORD_CONNECTOR_BOT_TOKEN }}
# Teams
TEAMS_APPLICATION_ID: ${{ secrets.TEAMS_APPLICATION_ID }}
TEAMS_DIRECTORY_ID: ${{ secrets.TEAMS_DIRECTORY_ID }}
TEAMS_SECRET: ${{ secrets.TEAMS_SECRET }}
# Bitbucket
BITBUCKET_WORKSPACE: ${{ secrets.BITBUCKET_WORKSPACE }}
BITBUCKET_REPOSITORIES: ${{ secrets.BITBUCKET_REPOSITORIES }}
BITBUCKET_PROJECTS: ${{ secrets.BITBUCKET_PROJECTS }}
BITBUCKET_EMAIL: ${{ vars.BITBUCKET_EMAIL }}
BITBUCKET_API_TOKEN: ${{ secrets.BITBUCKET_API_TOKEN }}
# Fireflies
FIREFLIES_API_KEY: ${{ secrets.FIREFLIES_API_KEY }}
jobs:
connectors-check:
# See https://runs-on.com/runners/linux/
runs-on: [runs-on, runner=8cpu-linux-x64, "run-id=${{ github.run_id }}-connectors-check", "extras=s3-cache"]
runs-on:
[
runs-on,
runner=8cpu-linux-x64,
"run-id=${{ github.run_id }}-connectors-check",
"extras=s3-cache",
]
timeout-minutes: 45
env:
PYTHONPATH: ./backend
DISABLE_TELEMETRY: "true"
environment: ci-protected
steps:
- uses: runs-on/action@cd2b598b0515d39d78c38a02d529db87d2196d1e # ratchet:runs-on/action@v2
@@ -188,6 +96,66 @@ jobs:
- 'backend/onyx/file_processing/**'
- 'uv.lock'
- name: Configure AWS credentials
uses: aws-actions/configure-aws-credentials@8df5847569e6427dd6c4fb1cf565c83acfa8afa7 # ratchet:aws-actions/configure-aws-credentials@v4
with:
role-to-assume: ${{ secrets.AWS_OIDC_ROLE_ARN }}
aws-region: us-east-2
- name: Get connector test secrets from AWS Secrets Manager
uses: aws-actions/aws-secretsmanager-get-secrets@a9a7eb4e2f2871d30dc5b892576fde60a2ecc802 # ratchet:aws-actions/aws-secretsmanager-get-secrets@v2
with:
parse-json-secrets: false
secret-ids: |
AWS_ACCESS_KEY_ID_DAILY_CONNECTOR_TESTS, test/aws-access-key-id
AWS_SECRET_ACCESS_KEY_DAILY_CONNECTOR_TESTS, test/aws-secret-access-key
R2_ACCESS_KEY_ID_DAILY_CONNECTOR_TESTS, test/r2-access-key-id
R2_SECRET_ACCESS_KEY_DAILY_CONNECTOR_TESTS, test/r2-secret-access-key
GCS_ACCESS_KEY_ID_DAILY_CONNECTOR_TESTS, test/gcs-access-key-id
GCS_SECRET_ACCESS_KEY_DAILY_CONNECTOR_TESTS, test/gcs-secret-access-key
CONFLUENCE_ACCESS_TOKEN, test/confluence-access-token
CONFLUENCE_ACCESS_TOKEN_SCOPED, test/confluence-access-token-scoped
JIRA_BASE_URL, test/jira-base-url
JIRA_USER_EMAIL, test/jira-user-email
JIRA_API_TOKEN, test/jira-api-token
JIRA_API_TOKEN_SCOPED, test/jira-api-token-scoped
GONG_ACCESS_KEY, test/gong-access-key
GONG_ACCESS_KEY_SECRET, test/gong-access-key-secret
GOOGLE_DRIVE_SERVICE_ACCOUNT_JSON_STR, test/google-drive-service-account-json
GOOGLE_DRIVE_OAUTH_CREDENTIALS_JSON_STR_TEST_USER_1, test/google-drive-oauth-creds-test-user-1
GOOGLE_DRIVE_OAUTH_CREDENTIALS_JSON_STR, test/google-drive-oauth-creds
GOOGLE_GMAIL_SERVICE_ACCOUNT_JSON_STR, test/google-gmail-service-account-json
GOOGLE_GMAIL_OAUTH_CREDENTIALS_JSON_STR, test/google-gmail-oauth-creds
SLAB_BOT_TOKEN, test/slab-bot-token
ZENDESK_SUBDOMAIN, test/zendesk-subdomain
ZENDESK_EMAIL, test/zendesk-email
ZENDESK_TOKEN, test/zendesk-token
SF_PASSWORD, test/sf-password
SF_SECURITY_TOKEN, test/sf-security-token
HUBSPOT_ACCESS_TOKEN, test/hubspot-access-token
IMAP_PASSWORD, test/imap-password
AIRTABLE_ACCESS_TOKEN, test/airtable-access-token
SHAREPOINT_CLIENT_SECRET, test/sharepoint-client-secret
PERM_SYNC_SHAREPOINT_CLIENT_ID, test/perm-sync-sharepoint-client-id
PERM_SYNC_SHAREPOINT_PRIVATE_KEY, test/perm-sync-sharepoint-private-key
PERM_SYNC_SHAREPOINT_CERTIFICATE_PASSWORD, test/perm-sync-sharepoint-cert-password
PERM_SYNC_SHAREPOINT_DIRECTORY_ID, test/perm-sync-sharepoint-directory-id
ACCESS_TOKEN_GITHUB, test/github-access-token
GITLAB_ACCESS_TOKEN, test/gitlab-access-token
GITBOOK_SPACE_ID, test/gitbook-space-id
GITBOOK_API_KEY, test/gitbook-api-key
NOTION_INTEGRATION_TOKEN, test/notion-integration-token
HIGHSPOT_KEY, test/highspot-key
HIGHSPOT_SECRET, test/highspot-secret
SLACK_BOT_TOKEN, test/slack-bot-token
DISCORD_CONNECTOR_BOT_TOKEN, test/discord-bot-token
TEAMS_APPLICATION_ID, test/teams-application-id
TEAMS_DIRECTORY_ID, test/teams-directory-id
TEAMS_SECRET, test/teams-secret
BITBUCKET_WORKSPACE, test/bitbucket-workspace
BITBUCKET_API_TOKEN, test/bitbucket-api-token
FIREFLIES_API_KEY, test/fireflies-api-key
- name: Run Tests (excluding HubSpot, Salesforce, GitHub, and Coda)
shell: script -q -e -c "bash --noprofile --norc -eo pipefail {0}"
run: |

View File

@@ -15,7 +15,6 @@ permissions:
jobs:
Deploy-Preview:
runs-on: ubuntu-latest
environment: ci-protected
timeout-minutes: 30
steps:
- uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd

View File

@@ -6,7 +6,7 @@ Use explicit type annotations for variables to enhance code clarity, especially
## Best Practices
Use `contributing_guides/best_practices.md` as core review context. Prefer consistency with existing patterns, fix issues in code you touch, avoid tacking new features onto muddy interfaces, fail loudly instead of silently swallowing errors, keep code strictly typed, preserve clear state boundaries, remove duplicate or dead logic, break up overly long functions, avoid hidden import-time side effects, respect module boundaries, and favor correctness-by-construction over relying on callers to use an API correctly.
Use the "Engineering Best Practices" section of `CONTRIBUTING.md` as core review context. Prefer consistency with existing patterns, fix issues in code you touch, avoid tacking new features onto muddy interfaces, fail loudly instead of silently swallowing errors, keep code strictly typed, preserve clear state boundaries, remove duplicate or dead logic, break up overly long functions, avoid hidden import-time side effects, respect module boundaries, and favor correctness-by-construction over relying on callers to use an API correctly.
## TODOs
@@ -27,6 +27,7 @@ Code changes must consider both multi-tenant and single-tenant deployments. In m
## Nginx Routing — New Backend Routes
Whenever a new backend route is added that does NOT start with `/api`, it must also be explicitly added to ALL nginx configs:
- `deployment/helm/charts/onyx/templates/nginx-conf.yaml` (Helm/k8s)
- `deployment/data/nginx/app.conf.template` (docker-compose dev)
- `deployment/data/nginx/app.conf.template.prod` (docker-compose prod)
@@ -37,3 +38,7 @@ Routes not starting with `/api` are not caught by the existing `^/(api|openapi\.
## Full vs Lite Deployments
Code changes must consider both regular Onyx deployments and Onyx lite deployments. Lite deployments disable the vector DB, Redis, model servers, and background workers by default, use PostgreSQL-backed cache/auth/file storage, and rely on the API server to handle background work. Do not assume those services are available unless the code path is explicitly limited to full deployments.
## SWR Cache Keys — Always Use SWR_KEYS Registry
All `useSWR()` calls and `mutate()` calls in the frontend must reference the centralized `SWR_KEYS` registry in `web/src/lib/swr-keys.ts` instead of inline endpoint strings or local string constants. Never write `useSWR("/api/some/endpoint", ...)` or `mutate("/api/some/endpoint")` — always use the corresponding `SWR_KEYS.someEndpoint` constant. If the endpoint does not yet exist in the registry, add it there first. This applies to all variants of an endpoint (e.g. query-string variants like `?get_editable=true` must also be registered as their own key).

View File

@@ -357,5 +357,5 @@ raise OnyxError(OnyxErrorCode.BAD_GATEWAY, detail, status_code_override=e.respon
## Best Practices
In addition to the other content in this file, best practices for contributing
to the codebase can be found at `contributing_guides/best_practices.md`.
Understand its contents and follow them.
to the codebase can be found in the "Engineering Best Practices" section of
`CONTRIBUTING.md`. Understand its contents and follow them.

View File

@@ -1,32 +1,487 @@
# Contributing to Onyx
Hey there! We are so excited that you're interested in Onyx.
## Table of Contents
- [Contribution Opportunities](#contribution-opportunities)
- [Contribution Process](#contribution-process)
- [Development Setup](#development-setup)
- [Prerequisites](#prerequisites)
- [Backend: Python Requirements](#backend-python-requirements)
- [Frontend: Node Dependencies](#frontend-node-dependencies)
- [Formatting and Linting](#formatting-and-linting)
- [Running the Application](#running-the-application)
- [VSCode Debugger (Recommended)](#vscode-debugger-recommended)
- [Manually Running for Development](#manually-running-for-development)
- [Running in Docker](#running-in-docker)
- [macOS-Specific Notes](#macos-specific-notes)
- [Engineering Best Practices](#engineering-best-practices)
- [Principles and Collaboration](#principles-and-collaboration)
- [Style and Maintainability](#style-and-maintainability)
- [Performance and Correctness](#performance-and-correctness)
- [Repository Conventions](#repository-conventions)
- [Release Process](#release-process)
- [Getting Help](#getting-help)
- [Enterprise Edition Contributions](#enterprise-edition-contributions)
---
## Contribution Opportunities
The [GitHub Issues](https://github.com/onyx-dot-app/onyx/issues) page is a great place to look for and share contribution ideas.
If you have your own feature that you would like to build please create an issue and community members can provide feedback and
thumb it up if they feel a common need.
If you have your own feature that you would like to build, please create an issue and community members can provide feedback and upvote if they feel a common need.
---
## Contributing Code
Please reference the documents in contributing_guides folder to ensure that the code base is kept to a high standard.
1. dev_setup.md (start here): gives you a guide to setting up a local development environment.
2. contribution_process.md: how to ensure you are building valuable features that will get reviewed and merged.
3. best_practices.md: before asking for reviews, ensure your changes meet the repo code quality standards.
## Contribution Process
To contribute, please follow the
["fork and pull request"](https://docs.github.com/en/get-started/quickstart/contributing-to-projects) workflow.
### 1. Get the feature or enhancement approved
Create a GitHub issue and see if there are upvotes. If you feel the feature is sufficiently value-additive and you would like approval to contribute it to the repo, tag [Yuhong](https://github.com/yuhongsun96) to review.
If you do not get a response within a week, feel free to email yuhong@onyx.app and include the issue in the message.
Not all small features and enhancements will be accepted as there is a balance between feature richness and bloat. We strive to provide the best user experience possible so we have to be intentional about what we include in the app.
### 2. Get the design approved
The Onyx team will either provide a design doc and PRD for the feature or request one from you, the contributor. The scope and detail of the design will depend on the individual feature.
### 3. IP attribution for EE contributions
If you are contributing features to Onyx Enterprise Edition, you are required to sign the [IP Assignment Agreement](contributor_ip_assignment/EE_Contributor_IP_Assignment_Agreement.md).
### 4. Review and testing
Your features must pass all tests and all comments must be addressed prior to merging.
### Implicit agreements
If we approve an issue, we are promising you the following:
- Your work will receive timely attention and we will put aside other important items to ensure you are not blocked.
- You will receive necessary coaching on eng quality, system design, etc. to ensure the feature is completed well.
- The Onyx team will pull resources and bandwidth from design, PM, and engineering to ensure that you have all the resources to build the feature to the quality required for merging.
Because this is a large investment from our team, we ask that you:
- Thoroughly read all the requirements of the design docs, engineering best practices, and try to minimize overhead for the Onyx team.
- Complete the feature in a timely manner to reduce context switching and an ongoing resource pull from the Onyx team.
---
## Development Setup
Onyx being a fully functional app, relies on some external software, specifically:
- [Postgres](https://www.postgresql.org/) (Relational DB)
- [OpenSearch](https://opensearch.org/) (Vector DB/Search Engine)
- [Redis](https://redis.io/) (Cache)
- [MinIO](https://min.io/) (File Store)
- [Nginx](https://nginx.org/) (Not needed for development flows generally)
> **Note:**
> This guide provides instructions to build and run Onyx locally from source with Docker containers providing the above external software.
> We believe this combination is easier for development purposes. If you prefer to use pre-built container images, see [Running in Docker](#running-in-docker) below.
### Prerequisites
- **Python 3.11** — If using a lower version, modifications will have to be made to the code. Higher versions may have library compatibility issues.
- **Docker** — Required for running external services (Postgres, OpenSearch, Redis, MinIO).
- **Node.js v22** — We recommend using [nvm](https://github.com/nvm-sh/nvm) to manage Node installations.
### Backend: Python Requirements
We use [uv](https://docs.astral.sh/uv/) and recommend creating a [virtual environment](https://docs.astral.sh/uv/pip/environments/#using-a-virtual-environment).
```bash
uv venv .venv --python 3.11
source .venv/bin/activate
```
_For Windows, activate the virtual environment using Command Prompt:_
```bash
.venv\Scripts\activate
```
If using PowerShell, the command slightly differs:
```powershell
.venv\Scripts\Activate.ps1
```
Install the required Python dependencies:
```bash
uv sync --all-extras
```
Install Playwright for Python (headless browser required by the Web Connector):
```bash
uv run playwright install
```
### Frontend: Node Dependencies
```bash
nvm install 22 && nvm use 22
node -v # verify your active version
```
Navigate to `onyx/web` and run:
```bash
npm i
```
### Formatting and Linting
#### Backend
Set up pre-commit hooks (black / reorder-python-imports):
```bash
uv run pre-commit install
```
We also use `mypy` for static type checking. Onyx is fully type-annotated, and we want to keep it that way! To run the mypy checks manually:
```bash
uv run mypy . # from onyx/backend
```
#### Frontend
We use `prettier` for formatting. The desired version will be installed via `npm i` from the `onyx/web` directory. To run the formatter:
```bash
npx prettier --write . # from onyx/web
```
Pre-commit will also run prettier automatically on files you've recently touched. If re-formatted, your commit will fail. Re-stage your changes and commit again.
---
## Running the Application
### VSCode Debugger (Recommended)
We highly recommend using VSCode's debugger for development.
#### Initial Setup
1. Copy `.vscode/env_template.txt` to `.vscode/.env`
2. Fill in the necessary environment variables in `.vscode/.env`
#### Using the Debugger
Before starting, make sure the Docker Daemon is running.
1. Open the Debug view in VSCode (Cmd+Shift+D on macOS)
2. From the dropdown at the top, select "Clear and Restart External Volumes and Containers" and press the green play button
3. From the dropdown at the top, select "Run All Onyx Services" and press the green play button
4. Navigate to http://localhost:3000 in your browser to start using the app
5. Set breakpoints by clicking to the left of line numbers to help debug while the app is running
6. Use the debug toolbar to step through code, inspect variables, etc.
> **Note:** "Clear and Restart External Volumes and Containers" will reset your Postgres and OpenSearch (relational-db and index). Only run this if you are okay with wiping your data.
**Features:**
- Hot reload is enabled for the web server and API servers
- Python debugging is configured with debugpy
- Environment variables are loaded from `.vscode/.env`
- Console output is organized in the integrated terminal with labeled tabs
### Manually Running for Development
#### Docker containers for external software
You will need Docker installed to run these containers.
Navigate to `onyx/deployment/docker_compose`, then start up Postgres/OpenSearch/Redis/MinIO with:
```bash
docker compose -f docker-compose.yml -f docker-compose.dev.yml up -d index relational_db cache minio
```
(index refers to OpenSearch, relational_db refers to Postgres, and cache refers to Redis)
#### Running Onyx locally
To start the frontend, navigate to `onyx/web` and run:
```bash
npm run dev
```
Next, start the model server which runs the local NLP models. Navigate to `onyx/backend` and run:
```bash
uvicorn model_server.main:app --reload --port 9000
```
_For Windows (for compatibility with both PowerShell and Command Prompt):_
```bash
powershell -Command "uvicorn model_server.main:app --reload --port 9000"
```
The first time running Onyx, you will need to run the DB migrations for Postgres. After the first time, this is no longer required unless the DB models change.
Navigate to `onyx/backend` and with the venv active, run:
```bash
alembic upgrade head
```
Next, start the task queue which orchestrates the background jobs. Still in `onyx/backend`, run:
```bash
python ./scripts/dev_run_background_jobs.py
```
To run the backend API server, navigate back to `onyx/backend` and run:
```bash
AUTH_TYPE=basic uvicorn onyx.main:app --reload --port 8080
```
_For Windows (for compatibility with both PowerShell and Command Prompt):_
```bash
powershell -Command "
$env:AUTH_TYPE='basic'
uvicorn onyx.main:app --reload --port 8080
"
```
> **Note:** If you need finer logging, add the additional environment variable `LOG_LEVEL=DEBUG` to the relevant services.
#### Wrapping up
You should now have 4 servers running:
- Web server
- Backend API
- Model server
- Background jobs
Now, visit http://localhost:3000 in your browser. You should see the Onyx onboarding wizard where you can connect your external LLM provider to Onyx.
You've successfully set up a local Onyx instance!
### Running in Docker
You can run the full Onyx application stack from pre-built images including all external software dependencies.
Navigate to `onyx/deployment/docker_compose` and run:
```bash
docker compose up -d
```
After Docker pulls and starts these containers, navigate to http://localhost:3000 to use Onyx.
If you want to make changes to Onyx and run those changes in Docker, you can also build a local version of the Onyx container images that incorporates your changes:
```bash
docker compose up -d --build
```
---
## macOS-Specific Notes
### Setting up Python
Ensure [Homebrew](https://brew.sh/) is already set up, then install Python 3.11:
```bash
brew install python@3.11
```
Add Python 3.11 to your path by adding the following line to `~/.zshrc`:
```
export PATH="$(brew --prefix)/opt/python@3.11/libexec/bin:$PATH"
```
> **Note:** You will need to open a new terminal for the path change above to take effect.
### Setting up Docker
On macOS, you will need to install [Docker Desktop](https://www.docker.com/products/docker-desktop/) and ensure it is running before continuing with the docker commands.
### Formatting and Linting
macOS will likely require you to remove some quarantine attributes on some of the hooks for them to execute properly. After installing pre-commit, run the following command:
```bash
sudo xattr -r -d com.apple.quarantine ~/.cache/pre-commit
```
---
## Engineering Best Practices
> These are also what we adhere to as a team internally, we love to build in the open and to uplevel our community and each other through being transparent.
### Principles and Collaboration
- **Use 1-way vs 2-way doors.** For 2-way doors, move faster and iterate. For 1-way doors, be more deliberate.
- **Consistency > being "right."** Prefer consistent patterns across the codebase. If something is truly bad, fix it everywhere.
- **Fix what you touch (selectively).**
- Don't feel obligated to fix every best-practice issue you notice.
- Don't introduce new bad practices.
- If your change touches code that violates best practices, fix it as part of the change.
- **Don't tack features on.** When adding functionality, restructure logically as needed to avoid muddying interfaces and accumulating tech debt.
### Style and Maintainability
#### Comments and readability
Add clear comments:
- At logical boundaries (e.g., interfaces) so the reader doesn't need to dig 10 layers deeper.
- Wherever assumptions are made or something non-obvious/unexpected is done.
- For complicated flows/functions.
- Wherever it saves time (e.g., nontrivial regex patterns).
#### Errors and exceptions
- **Fail loudly** rather than silently skipping work.
- Example: raise and let exceptions propagate instead of silently dropping a document.
- **Don't overuse `try/except`.**
- Put `try/except` at the correct logical level.
- Do not mask exceptions unless it is clearly appropriate.
#### Typing
- Everything should be **as strictly typed as possible**.
- Use `cast` for annoying/loose-typed interfaces (e.g., results of `run_functions_tuples_in_parallel`).
- Only `cast` when the type checker sees `Any` or types are too loose.
- Prefer types that are easy to read.
- Avoid dense types like `dict[tuple[str, str], list[list[float]]]`.
- Prefer domain models, e.g.:
- `EmbeddingModel(provider_name, model_name)` as a Pydantic model
- `dict[EmbeddingModel, list[EmbeddingVector]]`
#### State, objects, and boundaries
- Keep **clear logical boundaries** for state containers and objects.
- A **config** object should never contain things like a `db_session`.
- Avoid state containers that are overly nested, or huge + flat (use judgment).
- Prefer **composition and functional style** over inheritance/OOP.
- Prefer **no mutation** unless there's a strong reason.
- State objects should be **intentional and explicit**, ideally nonmutating.
- Use interfaces/objects to create clear separation of responsibility.
- Prefer simplicity when there's no clear gain.
- Avoid overcomplicated mechanisms like semaphores.
- Prefer **hash maps (dicts)** over tree structures unless there's a strong reason.
#### Naming
- Name variables carefully and intentionally.
- Prefer long, explicit names when undecided.
- Avoid single-character variables except for small, self-contained utilities (or not at all).
- Keep the same object/name consistent through the call stack and within functions when reasonable.
- Good: `for token in tokens:`
- Bad: `for msg in tokens:` (if iterating tokens)
- Function names should bias toward **long + descriptive** for codebase search.
- IntelliSense can miss call sites; search works best with unique names.
#### Correctness by construction
- Prefer self-contained correctness — don't rely on callers to "use it right" if you can make misuse hard.
- Avoid redundancies: if a function takes an arg, it shouldn't also take a state object that contains that same arg.
- No dead code (unless there's a very good reason).
- No commented-out code in main or feature branches (unless there's a very good reason).
- No duplicate logic:
- Don't copy/paste into branches when shared logic can live above the conditional.
- If you're afraid to touch the original, you don't understand it well enough.
- LLMs often create subtle duplicate logic — review carefully and remove it.
- Avoid "nearly identical" objects that confuse when to use which.
- Avoid extremely long functions with chained logic:
- Encapsulate steps into helpers for readability, even if not reused.
- "Pythonic" multi-step expressions are OK in moderation; don't trade clarity for cleverness.
### Performance and Correctness
- Avoid holding resources for extended periods (DB sessions, locks/semaphores).
- Validate objects on creation and right before use.
- Connector code (data to Onyx documents):
- Any in-memory structure that can grow without bound based on input must be periodically size-checked.
- If a connector is OOMing (often shows up as "missing celery tasks"), this is a top thing to check retroactively.
- Async and event loops:
- Never introduce new async/event loop Python code, and try to make existing async code synchronous when possible if it makes sense.
- Writing async code without 100% understanding the code and having a concrete reason to do so is likely to introduce bugs and not add any meaningful performance gains.
### Repository Conventions
#### Where code lives
- Pydantic + data models: `models.py` files.
- DB interface functions (excluding lazy loading): `db/` directory.
- LLM prompts: `prompts/` directory, roughly mirroring the code layout that uses them.
- API routes: `server/` directory.
#### Pydantic and modeling
- Prefer **Pydantic** over dataclasses.
- If absolutely required, use `allow_arbitrary_types`.
#### Data conventions
- Prefer explicit `None` over sentinel empty strings (usually; depends on intent).
- Prefer explicit identifiers: use string enums instead of integer codes.
- Avoid magic numbers (co-location is good when necessary). **Always avoid magic strings.**
#### Logging
- Log messages where they are created.
- Don't propagate log messages around just to log them elsewhere.
#### Encapsulation
- Don't use private attributes/methods/properties from other classes/modules.
- "Private" is private — respect that boundary.
#### SQLAlchemy guidance
- Lazy loading is often bad at scale, especially across multiple list relationships.
- Be careful when accessing SQLAlchemy object attributes:
- It can help avoid redundant DB queries,
- but it can also fail if accessed outside an active session,
- and lazy loading can add hidden DB dependencies to otherwise "simple" functions.
- Reference: https://www.reddit.com/r/SQLAlchemy/comments/138f248/joinedload_vs_selectinload/
#### Trunk-based development and feature flags
- **PRs should contain no more than 500 lines of real change.**
- **Merge to main frequently.** Avoid long-lived feature branches — they create merge conflicts and integration pain.
- **Use feature flags for incremental rollout.**
- Large features should be merged in small, shippable increments behind a flag.
- This allows continuous integration without exposing incomplete functionality.
- **Keep flags short-lived.** Once a feature is fully rolled out, remove the flag and dead code paths promptly.
- **Flag at the right level.** Prefer flagging at API/UI entry points rather than deep in business logic.
- **Test both flag states.** Ensure the codebase works correctly with the flag on and off.
#### Miscellaneous
- Any TODOs you add in the code must be accompanied by either the name/username of the owner of that TODO, or an issue number for an issue referencing that piece of work.
- Avoid module-level logic that runs on import, which leads to import-time side effects. Essentially every piece of meaningful logic should exist within some function that has to be explicitly invoked. Acceptable exceptions may include loading environment variables or setting up loggers.
- If you find yourself needing something like this, you may want that logic to exist in a file dedicated for manual execution (contains `if __name__ == "__main__":`) which should not be imported by anything else.
- Do not conflate Python scripts you intend to run from the command line (contains `if __name__ == "__main__":`) with modules you intend to import from elsewhere. If for some unlikely reason they have to be the same file, any logic specific to executing the file (including imports) should be contained in the `if __name__ == "__main__":` block.
- Generally these executable files exist in `backend/scripts/`.
---
## Release Process
Onyx loosely follows the SemVer versioning standard.
A set of Docker containers will be pushed automatically to DockerHub with every tag.
You can see the containers [here](https://hub.docker.com/search?q=onyx%2F).
---
## Getting Help
## Getting Help 🙋
We have support channels and generally interesting discussions on our [Discord](https://discord.gg/4NA5SbzrWb).
See you there!
---
## Release Process
Onyx loosely follows the SemVer versioning standard.
Major changes are released with a "minor" version bump. Currently we use patch release versions to indicate small feature changes.
A set of Docker containers will be pushed automatically to DockerHub with every tag.
You can see the containers [here](https://hub.docker.com/search?q=onyx%2F).
## Enterprise Edition Contributions
If you are contributing features to Onyx Enterprise Edition (code under any `ee/` directory), you are required to sign the [IP Assignment Agreement](contributor_ip_assignment/EE_Contributor_IP_Assignment_Agreement.md) ([PDF version](contributor_ip_assignment/EE_Contributor_IP_Assignment_Agreement.pdf)).

102
README.md
View File

@@ -4,8 +4,6 @@
<a href="https://www.onyx.app/?utm_source=onyx_repo&utm_medium=github&utm_campaign=readme"> <img width="50%" src="https://github.com/onyx-dot-app/onyx/blob/logo/OnyxLogoCropped.jpg?raw=true" /></a>
</h2>
<p align="center">Open Source AI Platform</p>
<p align="center">
<a href="https://discord.gg/TDJ59cGV2X" target="_blank">
<img src="https://img.shields.io/badge/discord-join-blue.svg?logo=discord&logoColor=white" alt="Discord" />
@@ -27,82 +25,94 @@
</a>
</p>
# Onyx - The Open Source AI Platform
**[Onyx](https://www.onyx.app/?utm_source=onyx_repo&utm_medium=github&utm_campaign=readme)** is a feature-rich, self-hostable Chat UI that works with any LLM. It is easy to deploy and can run in a completely airgapped environment.
**[Onyx](https://www.onyx.app/?utm_source=onyx_repo&utm_medium=github&utm_campaign=readme)** is the application layer for LLMs - bringing a feature-rich interface that can be easily hosted by anyone.
Onyx enables LLMs through advanced capabilities like RAG, web search, code execution, file creation, deep research and more.
Onyx comes loaded with advanced features like Agents, Web Search, RAG, MCP, Deep Research, Connectors to 40+ knowledge sources, and more.
Connect your applications with over 50+ indexing based connectors provided out of the box or via MCP.
> [!TIP]
> Run Onyx with one command (or see deployment section below):
> Deploy with a single command:
> ```
> curl -fsSL https://onyx.app/install_onyx.sh | bash
> ```
****
![Onyx Chat Silent Demo](https://github.com/onyx-dot-app/onyx/releases/download/v0.21.1/OnyxChatSilentDemo.gif)
![Onyx Chat Silent Demo](https://github.com/onyx-dot-app/onyx/releases/download/v3.0.0/Onyx.gif)
---
## ⭐ Features
- **🤖 Custom Agents:** Build AI Agents with unique instructions, knowledge and actions.
- **🌍 Web Search:** Browse the web with Google PSE, Exa, and Serper as well as an in-house scraper or Firecrawl.
- **🔍 RAG:** Best in class hybrid-search + knowledge graph for uploaded files and ingested documents from connectors.
- **🔄 Connectors:** Pull knowledge, metadata, and access information from over 40 applications.
- **🔬 Deep Research:** Get in depth answers with an agentic multi-step search.
- **▶️ Actions & MCP:** Give AI Agents the ability to interact with external systems.
- **💻 Code Interpreter:** Execute code to analyze data, render graphs and create files.
- **🔍 Agentic RAG:** Get best in class search and answer quality based on hybrid index + AI Agents for information retrieval
- Benchmark to release soon!
- **🔬 Deep Research:** Get in depth reports with a multi-step research flow.
- Top of [leaderboard](https://github.com/onyx-dot-app/onyx_deep_research_bench) as of Feb 2026.
- **🤖 Custom Agents:** Build AI Agents with unique instructions, knowledge, and actions.
- **🌍 Web Search:** Browse the web to get up to date information.
- Supports Serper, Google PSE, Brave, SearXNG, and others.
- Comes with an in house web crawler and support for Firecrawl/Exa.
- **📄 Artifacts:** Generate documents, graphics, and other downloadable artifacts.
- **▶️ Actions & MCP:** Let Onyx agents interact with external applications, comes with flexible Auth options.
- **💻 Code Execution:** Execute code in a sandbox to analyze data, render graphs, or modify files.
- **🎙️ Voice Mode:** Chat with Onyx via text-to-speech and speech-to-text.
- **🎨 Image Generation:** Generate images based on user prompts.
- **👥 Collaboration:** Chat sharing, feedback gathering, user management, usage analytics, and more.
Onyx works with all LLMs (like OpenAI, Anthropic, Gemini, etc.) and self-hosted LLMs (like Ollama, vLLM, etc.)
Onyx supports all major LLM providers, both self-hosted (like Ollama, LiteLLM, vLLM, etc.) and proprietary (like Anthropic, OpenAI, Gemini, etc.).
To learn more about the features, check out our [documentation](https://docs.onyx.app/welcome?utm_source=onyx_repo&utm_medium=github&utm_campaign=readme)!
To learn more - check out our [docs](https://docs.onyx.app/welcome?utm_source=onyx_repo&utm_medium=github&utm_campaign=readme)!
---
## 🚀 Deployment Modes
## 🚀 Deployment
Onyx supports deployments in Docker, Kubernetes, Terraform, along with guides for major cloud providers.
> Onyx supports deployments in Docker, Kubernetes, Helm/Terraform and provides guides for major cloud providers.
> Detailed deployment guides found [here](https://docs.onyx.app/deployment/overview).
See guides below:
- [Docker](https://docs.onyx.app/deployment/local/docker?utm_source=onyx_repo&utm_medium=github&utm_campaign=readme) or [Quickstart](https://docs.onyx.app/deployment/getting_started/quickstart?utm_source=onyx_repo&utm_medium=github&utm_campaign=readme) (best for most users)
- [Kubernetes](https://docs.onyx.app/deployment/local/kubernetes?utm_source=onyx_repo&utm_medium=github&utm_campaign=readme) (best for large teams)
- [Terraform](https://docs.onyx.app/deployment/local/terraform?utm_source=onyx_repo&utm_medium=github&utm_campaign=readme) (best for teams already using Terraform)
- Cloud specific guides (best if specifically using [AWS EKS](https://docs.onyx.app/deployment/cloud/aws/eks?utm_source=onyx_repo&utm_medium=github&utm_campaign=readme), [Azure VMs](https://docs.onyx.app/deployment/cloud/azure?utm_source=onyx_repo&utm_medium=github&utm_campaign=readme), etc.)
Onyx supports two separate deployment options: standard and lite.
#### Onyx Lite
The Lite mode can be thought of as a lightweight Chat UI. It requires less resources (under 1GB memory) and runs a less complex stack.
It is great for users who want to test out Onyx quickly or for teams who are only interested in the Chat UI and Agents functionalities.
#### Standard Onyx
The complete feature set of Onyx which is recommended for serious users and larger teams. Additional components not included in Lite mode:
- Vector + Keyword index for RAG.
- Background containers to run job queues and workers for syncing knowledge from connectors.
- AI model inference servers to run deep learning models used during indexing and inference.
- Performance optimizations for large scale use via in memory cache (Redis) and blob store (MinIO).
> [!TIP]
> **To try Onyx for free without deploying, check out [Onyx Cloud](https://cloud.onyx.app/signup?utm_source=onyx_repo&utm_medium=github&utm_campaign=readme)**.
> **To try Onyx for free without deploying, visit [Onyx Cloud](https://cloud.onyx.app/signup?utm_source=onyx_repo&utm_medium=github&utm_campaign=readme)**.
---
## 🏢 Onyx for Enterprise
## 🔍 Other Notable Benefits
Onyx is built for teams of all sizes, from individual users to the largest global enterprises.
- **Enterprise Search**: far more than simple RAG, Onyx has custom indexing and retrieval that remains performant and accurate for scales of up to tens of millions of documents.
- **Security**: SSO (OIDC/SAML/OAuth2), RBAC, encryption of credentials, etc.
- **Management UI**: different user roles such as basic, curator, and admin.
- **Document Permissioning**: mirrors user access from external apps for RAG use cases.
## 🚧 Roadmap
To see ongoing and upcoming projects, check out our [roadmap](https://github.com/orgs/onyx-dot-app/projects/2)!
Onyx is built for teams of all sizes, from individual users to the largest global enterprises:
- 👥 Collaboration: Share chats and agents with other members of your organization.
- 🔐 Single Sign On: SSO via Google OAuth, OIDC, or SAML. Group syncing and user provisioning via SCIM.
- 🛡️ Role Based Access Control: RBAC for sensitive resources like access to agents, actions, etc.
- 📊 Analytics: Usage graphs broken down by teams, LLMs, or agents.
- 🕵️ Query History: Audit usage to ensure safe adoption of AI in your organization.
- 💻 Custom code: Run custom code to remove PII, reject sensitive queries, or to run custom analysis.
- 🎨 Whitelabeling: Customize the look and feel of Onyx with custom naming, icons, banners, and more.
## 📚 Licensing
There are two editions of Onyx:
- Onyx Community Edition (CE) is available freely under the MIT license.
- Onyx Community Edition (CE) is available freely under the MIT license and covers all of the core features for Chat, RAG, Agents, and Actions.
- Onyx Enterprise Edition (EE) includes extra features that are primarily useful for larger organizations.
For feature details, check out [our website](https://www.onyx.app/pricing?utm_source=onyx_repo&utm_medium=github&utm_campaign=readme).
## 👪 Community
Join our open source community on **[Discord](https://discord.gg/TDJ59cGV2X)**!
## 💡 Contributing
Looking to contribute? Please check out the [Contribution Guide](CONTRIBUTING.md) for more details.

View File

@@ -0,0 +1,54 @@
"""csv to tabular chat file type
Revision ID: 8188861f4e92
Revises: d8cdfee5df80
Create Date: 2026-03-31 19:23:05.753184
"""
from alembic import op
# revision identifiers, used by Alembic.
revision = "8188861f4e92"
down_revision = "d8cdfee5df80"
branch_labels = None
depends_on = None
def upgrade() -> None:
op.execute(
"""
UPDATE chat_message
SET files = (
SELECT jsonb_agg(
CASE
WHEN elem->>'type' = 'csv'
THEN jsonb_set(elem, '{type}', '"tabular"')
ELSE elem
END
)
FROM jsonb_array_elements(files) AS elem
)
WHERE files::text LIKE '%"type": "csv"%'
"""
)
def downgrade() -> None:
op.execute(
"""
UPDATE chat_message
SET files = (
SELECT jsonb_agg(
CASE
WHEN elem->>'type' = 'tabular'
THEN jsonb_set(elem, '{type}', '"csv"')
ELSE elem
END
)
FROM jsonb_array_elements(files) AS elem
)
WHERE files::text LIKE '%"type": "tabular"%'
"""
)

View File

@@ -0,0 +1,55 @@
"""add skipped to userfilestatus
Revision ID: d8cdfee5df80
Revises: 1d78c0ca7853
Create Date: 2026-04-01 10:47:12.593950
"""
from alembic import op
import sqlalchemy as sa
# revision identifiers, used by Alembic.
revision = "d8cdfee5df80"
down_revision = "1d78c0ca7853"
branch_labels = None
depends_on = None
TABLE = "user_file"
COLUMN = "status"
CONSTRAINT_NAME = "ck_user_file_status"
OLD_VALUES = ("PROCESSING", "INDEXING", "COMPLETED", "FAILED", "CANCELED", "DELETING")
NEW_VALUES = (
"PROCESSING",
"INDEXING",
"COMPLETED",
"SKIPPED",
"FAILED",
"CANCELED",
"DELETING",
)
def _drop_status_check_constraint() -> None:
inspector = sa.inspect(op.get_bind())
for constraint in inspector.get_check_constraints(TABLE):
if COLUMN in constraint.get("sqltext", ""):
constraint_name = constraint["name"]
if constraint_name is not None:
op.drop_constraint(constraint_name, TABLE, type_="check")
def upgrade() -> None:
_drop_status_check_constraint()
in_clause = ", ".join(f"'{v}'" for v in NEW_VALUES)
op.create_check_constraint(CONSTRAINT_NAME, TABLE, f"{COLUMN} IN ({in_clause})")
def downgrade() -> None:
op.execute(f"UPDATE {TABLE} SET {COLUMN} = 'COMPLETED' WHERE {COLUMN} = 'SKIPPED'")
_drop_status_check_constraint()
in_clause = ", ".join(f"'{v}'" for v in OLD_VALUES)
op.create_check_constraint(CONSTRAINT_NAME, TABLE, f"{COLUMN} IN ({in_clause})")

View File

@@ -5,6 +5,7 @@ from onyx.background.celery.apps.primary import celery_app
celery_app.autodiscover_tasks(
app_base.filter_task_modules(
[
"ee.onyx.background.celery.tasks.hooks",
"ee.onyx.background.celery.tasks.doc_permission_syncing",
"ee.onyx.background.celery.tasks.external_group_syncing",
"ee.onyx.background.celery.tasks.cloud",

View File

@@ -55,6 +55,15 @@ ee_tasks_to_schedule: list[dict] = []
if not MULTI_TENANT:
ee_tasks_to_schedule = [
{
"name": "hook-execution-log-cleanup",
"task": OnyxCeleryTask.HOOK_EXECUTION_LOG_CLEANUP_TASK,
"schedule": timedelta(days=1),
"options": {
"priority": OnyxCeleryPriority.LOW,
"expires": BEAT_EXPIRES_DEFAULT,
},
},
{
"name": "autogenerate-usage-report",
"task": OnyxCeleryTask.GENERATE_USAGE_REPORT_TASK,

View File

@@ -13,6 +13,7 @@ from redis.lock import Lock as RedisLock
from ee.onyx.server.tenants.provisioning import setup_tenant
from ee.onyx.server.tenants.schema_management import create_schema_if_not_exists
from ee.onyx.server.tenants.schema_management import get_current_alembic_version
from ee.onyx.server.tenants.schema_management import run_alembic_migrations
from onyx.background.celery.apps.app_base import task_logger
from onyx.configs.app_configs import TARGET_AVAILABLE_TENANTS
from onyx.configs.constants import ONYX_CLOUD_TENANT_ID
@@ -29,9 +30,10 @@ from shared_configs.configs import TENANT_ID_PREFIX
# Each tenant takes ~80s (alembic migrations), so 5 tenants ≈ 7 minutes.
_MAX_TENANTS_PER_RUN = 5
# Time limits sized for worst-case batch: _MAX_TENANTS_PER_RUN × ~90s + buffer.
_TENANT_PROVISIONING_SOFT_TIME_LIMIT = 60 * 10 # 10 minutes
_TENANT_PROVISIONING_TIME_LIMIT = 60 * 15 # 15 minutes
# Time limits sized for worst-case: provisioning up to _MAX_TENANTS_PER_RUN new tenants
# (~90s each) plus migrating up to TARGET_AVAILABLE_TENANTS pool tenants (~90s each).
_TENANT_PROVISIONING_SOFT_TIME_LIMIT = 60 * 20 # 20 minutes
_TENANT_PROVISIONING_TIME_LIMIT = 60 * 25 # 25 minutes
@shared_task(
@@ -91,8 +93,7 @@ def check_available_tenants(self: Task) -> None: # noqa: ARG001
batch_size = min(tenants_to_provision, _MAX_TENANTS_PER_RUN)
if batch_size < tenants_to_provision:
task_logger.info(
f"Capping batch to {batch_size} "
f"(need {tenants_to_provision}, will catch up next cycle)"
f"Capping batch to {batch_size} (need {tenants_to_provision}, will catch up next cycle)"
)
provisioned = 0
@@ -103,12 +104,14 @@ def check_available_tenants(self: Task) -> None: # noqa: ARG001
provisioned += 1
except Exception:
task_logger.exception(
f"Failed to provision tenant {i + 1}/{batch_size}, "
"continuing with remaining tenants"
f"Failed to provision tenant {i + 1}/{batch_size}, continuing with remaining tenants"
)
task_logger.info(f"Provisioning complete: {provisioned}/{batch_size} succeeded")
# Migrate any pool tenants that were provisioned before a new migration was deployed
_migrate_stale_pool_tenants()
except Exception:
task_logger.exception("Error in check_available_tenants task")
@@ -121,6 +124,46 @@ def check_available_tenants(self: Task) -> None: # noqa: ARG001
)
def _migrate_stale_pool_tenants() -> None:
"""
Run alembic upgrade head on all pool tenants. Since alembic upgrade head is
idempotent, tenants already at head are a fast no-op. This ensures pool
tenants are always current so that signup doesn't hit schema mismatches
(e.g. missing columns added after the tenant was pre-provisioned).
"""
with get_session_with_shared_schema() as db_session:
pool_tenants = db_session.query(AvailableTenant).all()
tenant_ids = [t.tenant_id for t in pool_tenants]
if not tenant_ids:
return
task_logger.info(
f"Checking {len(tenant_ids)} pool tenant(s) for pending migrations"
)
for tenant_id in tenant_ids:
try:
run_alembic_migrations(tenant_id)
new_version = get_current_alembic_version(tenant_id)
with get_session_with_shared_schema() as db_session:
tenant = (
db_session.query(AvailableTenant)
.filter_by(tenant_id=tenant_id)
.first()
)
if tenant and tenant.alembic_version != new_version:
task_logger.info(
f"Migrated pool tenant {tenant_id}: {tenant.alembic_version} -> {new_version}"
)
tenant.alembic_version = new_version
db_session.commit()
except Exception:
task_logger.exception(
f"Failed to migrate pool tenant {tenant_id}, skipping"
)
def pre_provision_tenant() -> bool:
"""
Pre-provision a new tenant and store it in the NewAvailableTenant table.

View File

@@ -69,5 +69,7 @@ EE_ONLY_PATH_PREFIXES: frozenset[str] = frozenset(
"/admin/token-rate-limits",
# Evals
"/evals",
# Hook extensions
"/admin/hooks",
}
)

View File

View File

@@ -0,0 +1,385 @@
"""Hook executor — calls a customer's external HTTP endpoint for a given hook point.
Usage (Celery tasks and FastAPI handlers):
result = execute_hook(
db_session=db_session,
hook_point=HookPoint.QUERY_PROCESSING,
payload={"query": "...", "user_email": "...", "chat_session_id": "..."},
response_type=QueryProcessingResponse,
)
if isinstance(result, HookSkipped):
# no active hook configured — continue with original behavior
...
elif isinstance(result, HookSoftFailed):
# hook failed but fail strategy is SOFT — continue with original behavior
...
else:
# result is a validated Pydantic model instance (response_type)
...
is_reachable update policy
--------------------------
``is_reachable`` on the Hook row is updated selectively — only when the outcome
carries meaningful signal about physical reachability:
NetworkError (DNS, connection refused) → False (cannot reach the server)
HTTP 401 / 403 → False (api_key revoked or invalid)
TimeoutException → None (server may be slow, skip write)
Other HTTP errors (4xx / 5xx) → None (server responded, skip write)
Unknown exception → None (no signal, skip write)
Non-JSON / non-dict response → None (server responded, skip write)
Success (2xx, valid dict) → True (confirmed reachable)
None means "leave the current value unchanged" — no DB round-trip is made.
DB session design
-----------------
The executor uses three sessions:
1. Caller's session (db_session) — used only for the hook lookup read. All
needed fields are extracted from the Hook object before the HTTP call, so
the caller's session is not held open during the external HTTP request.
2. Log session — a separate short-lived session opened after the HTTP call
completes to write the HookExecutionLog row on failure. Success runs are
not recorded. Committed independently of everything else.
3. Reachable session — a second short-lived session to update is_reachable on
the Hook. Kept separate from the log session so a concurrent hook deletion
(which causes update_hook__no_commit to raise OnyxError(NOT_FOUND)) cannot
prevent the execution log from being written. This update is best-effort.
"""
import json
import time
from typing import Any
from typing import TypeVar
import httpx
from pydantic import BaseModel
from pydantic import ValidationError
from sqlalchemy.orm import Session
from onyx.db.engine.sql_engine import get_session_with_current_tenant
from onyx.db.enums import HookFailStrategy
from onyx.db.enums import HookPoint
from onyx.db.hook import create_hook_execution_log__no_commit
from onyx.db.hook import get_non_deleted_hook_by_hook_point
from onyx.db.hook import update_hook__no_commit
from onyx.db.models import Hook
from onyx.error_handling.error_codes import OnyxErrorCode
from onyx.error_handling.exceptions import OnyxError
from onyx.hooks.executor import HookSkipped
from onyx.hooks.executor import HookSoftFailed
from onyx.utils.logger import setup_logger
from shared_configs.configs import MULTI_TENANT
logger = setup_logger()
T = TypeVar("T", bound=BaseModel)
# ---------------------------------------------------------------------------
# Private helpers
# ---------------------------------------------------------------------------
class _HttpOutcome(BaseModel):
"""Structured result of an HTTP hook call, returned by _process_response."""
is_success: bool
updated_is_reachable: (
bool | None
) # True/False = write to DB, None = unchanged (skip write)
status_code: int | None
error_message: str | None
response_payload: dict[str, Any] | None
def _lookup_hook(
db_session: Session,
hook_point: HookPoint,
) -> Hook | HookSkipped:
"""Return the active Hook or HookSkipped if hooks are unavailable/unconfigured.
No HTTP call is made and no DB writes are performed for any HookSkipped path.
There is nothing to log and no reachability information to update.
"""
if MULTI_TENANT:
return HookSkipped()
hook = get_non_deleted_hook_by_hook_point(
db_session=db_session, hook_point=hook_point
)
if hook is None or not hook.is_active:
return HookSkipped()
if not hook.endpoint_url:
return HookSkipped()
return hook
def _process_response(
*,
response: httpx.Response | None,
exc: Exception | None,
timeout: float,
) -> _HttpOutcome:
"""Process the result of an HTTP call and return a structured outcome.
Called after the client.post() try/except. If post() raised, exc is set and
response is None. Otherwise response is set and exc is None. Handles
raise_for_status(), JSON decoding, and the dict shape check.
"""
if exc is not None:
if isinstance(exc, httpx.NetworkError):
msg = f"Hook network error (endpoint unreachable): {exc}"
logger.warning(msg, exc_info=exc)
return _HttpOutcome(
is_success=False,
updated_is_reachable=False,
status_code=None,
error_message=msg,
response_payload=None,
)
if isinstance(exc, httpx.TimeoutException):
msg = f"Hook timed out after {timeout}s: {exc}"
logger.warning(msg, exc_info=exc)
return _HttpOutcome(
is_success=False,
updated_is_reachable=None, # timeout doesn't indicate unreachability
status_code=None,
error_message=msg,
response_payload=None,
)
msg = f"Hook call failed: {exc}"
logger.exception(msg, exc_info=exc)
return _HttpOutcome(
is_success=False,
updated_is_reachable=None, # unknown error — don't make assumptions
status_code=None,
error_message=msg,
response_payload=None,
)
if response is None:
raise ValueError(
"exactly one of response or exc must be non-None; both are None"
)
status_code = response.status_code
try:
response.raise_for_status()
except httpx.HTTPStatusError as e:
msg = f"Hook returned HTTP {e.response.status_code}: {e.response.text}"
logger.warning(msg, exc_info=e)
# 401/403 means the api_key has been revoked or is invalid — mark unreachable
# so the operator knows to update it. All other HTTP errors keep is_reachable
# as-is (server is up, the request just failed for application reasons).
auth_failed = e.response.status_code in (401, 403)
return _HttpOutcome(
is_success=False,
updated_is_reachable=False if auth_failed else None,
status_code=status_code,
error_message=msg,
response_payload=None,
)
try:
response_payload = response.json()
except (json.JSONDecodeError, httpx.DecodingError) as e:
msg = f"Hook returned non-JSON response: {e}"
logger.warning(msg, exc_info=e)
return _HttpOutcome(
is_success=False,
updated_is_reachable=None, # server responded — reachability unchanged
status_code=status_code,
error_message=msg,
response_payload=None,
)
if not isinstance(response_payload, dict):
msg = f"Hook returned non-dict JSON (got {type(response_payload).__name__})"
logger.warning(msg)
return _HttpOutcome(
is_success=False,
updated_is_reachable=None, # server responded — reachability unchanged
status_code=status_code,
error_message=msg,
response_payload=None,
)
return _HttpOutcome(
is_success=True,
updated_is_reachable=True,
status_code=status_code,
error_message=None,
response_payload=response_payload,
)
def _persist_result(
*,
hook_id: int,
outcome: _HttpOutcome,
duration_ms: int,
) -> None:
"""Write the execution log on failure and optionally update is_reachable, each
in its own session so a failure in one does not affect the other."""
# Only write the execution log on failure — success runs are not recorded.
# Must not be skipped if the is_reachable update fails (e.g. hook concurrently
# deleted between the initial lookup and here).
if not outcome.is_success:
try:
with get_session_with_current_tenant() as log_session:
create_hook_execution_log__no_commit(
db_session=log_session,
hook_id=hook_id,
is_success=False,
error_message=outcome.error_message,
status_code=outcome.status_code,
duration_ms=duration_ms,
)
log_session.commit()
except Exception:
logger.exception(
f"Failed to persist hook execution log for hook_id={hook_id}"
)
# Update is_reachable separately — best-effort, non-critical.
# None means the value is unchanged (set by the caller to skip the no-op write).
# update_hook__no_commit can raise OnyxError(NOT_FOUND) if the hook was
# concurrently deleted, so keep this isolated from the log write above.
if outcome.updated_is_reachable is not None:
try:
with get_session_with_current_tenant() as reachable_session:
update_hook__no_commit(
db_session=reachable_session,
hook_id=hook_id,
is_reachable=outcome.updated_is_reachable,
)
reachable_session.commit()
except Exception:
logger.warning(f"Failed to update is_reachable for hook_id={hook_id}")
# ---------------------------------------------------------------------------
# Public API
# ---------------------------------------------------------------------------
def _execute_hook_inner(
hook: Hook,
payload: dict[str, Any],
response_type: type[T],
) -> T | HookSoftFailed:
"""Make the HTTP call, validate the response, and return a typed model.
Raises OnyxError on HARD failure. Returns HookSoftFailed on SOFT failure.
"""
timeout = hook.timeout_seconds
hook_id = hook.id
fail_strategy = hook.fail_strategy
endpoint_url = hook.endpoint_url
current_is_reachable: bool | None = hook.is_reachable
if not endpoint_url:
raise ValueError(
f"hook_id={hook_id} is active but has no endpoint_url — "
"active hooks without an endpoint_url must be rejected by _lookup_hook"
)
start = time.monotonic()
response: httpx.Response | None = None
exc: Exception | None = None
try:
api_key: str | None = (
hook.api_key.get_value(apply_mask=False) if hook.api_key else None
)
headers: dict[str, str] = {"Content-Type": "application/json"}
if api_key:
headers["Authorization"] = f"Bearer {api_key}"
with httpx.Client(
timeout=timeout, follow_redirects=False
) as client: # SSRF guard: never follow redirects
response = client.post(endpoint_url, json=payload, headers=headers)
except Exception as e:
exc = e
duration_ms = int((time.monotonic() - start) * 1000)
outcome = _process_response(response=response, exc=exc, timeout=timeout)
# Validate the response payload against response_type.
# A validation failure downgrades the outcome to a failure so it is logged,
# is_reachable is left unchanged (server responded — just a bad payload),
# and fail_strategy is respected below.
validated_model: T | None = None
if outcome.is_success and outcome.response_payload is not None:
try:
validated_model = response_type.model_validate(outcome.response_payload)
except ValidationError as e:
msg = (
f"Hook response failed validation against {response_type.__name__}: {e}"
)
outcome = _HttpOutcome(
is_success=False,
updated_is_reachable=None, # server responded — reachability unchanged
status_code=outcome.status_code,
error_message=msg,
response_payload=None,
)
# Skip the is_reachable write when the value would not change — avoids a
# no-op DB round-trip on every call when the hook is already in the expected state.
if outcome.updated_is_reachable == current_is_reachable:
outcome = outcome.model_copy(update={"updated_is_reachable": None})
_persist_result(hook_id=hook_id, outcome=outcome, duration_ms=duration_ms)
if not outcome.is_success:
if fail_strategy == HookFailStrategy.HARD:
raise OnyxError(
OnyxErrorCode.HOOK_EXECUTION_FAILED,
outcome.error_message or "Hook execution failed.",
)
logger.warning(
f"Hook execution failed (soft fail) for hook_id={hook_id}: {outcome.error_message}"
)
return HookSoftFailed()
if validated_model is None:
raise OnyxError(
OnyxErrorCode.INTERNAL_ERROR,
f"validated_model is None for successful hook call (hook_id={hook_id})",
)
return validated_model
def _execute_hook_impl(
*,
db_session: Session,
hook_point: HookPoint,
payload: dict[str, Any],
response_type: type[T],
) -> T | HookSkipped | HookSoftFailed:
"""EE implementation — loaded by CE's execute_hook via fetch_versioned_implementation.
Returns HookSkipped if no active hook is configured, HookSoftFailed if the
hook failed with SOFT fail strategy, or a validated response model on success.
Raises OnyxError on HARD failure or if the hook is misconfigured.
"""
hook = _lookup_hook(db_session, hook_point)
if isinstance(hook, HookSkipped):
return hook
fail_strategy = hook.fail_strategy
hook_id = hook.id
try:
return _execute_hook_inner(hook, payload, response_type)
except Exception:
if fail_strategy == HookFailStrategy.SOFT:
logger.exception(
f"Unexpected error in hook execution (soft fail) for hook_id={hook_id}"
)
return HookSoftFailed()
raise

View File

@@ -15,6 +15,7 @@ from ee.onyx.server.enterprise_settings.api import (
basic_router as enterprise_settings_router,
)
from ee.onyx.server.evals.api import router as evals_router
from ee.onyx.server.features.hooks.api import router as hook_router
from ee.onyx.server.license.api import router as license_router
from ee.onyx.server.manage.standard_answer import router as standard_answer_router
from ee.onyx.server.middleware.license_enforcement import (
@@ -138,6 +139,7 @@ def get_application() -> FastAPI:
include_router_with_global_prefix_prepended(application, ee_oauth_router)
include_router_with_global_prefix_prepended(application, ee_document_cc_pair_router)
include_router_with_global_prefix_prepended(application, evals_router)
include_router_with_global_prefix_prepended(application, hook_router)
# Enterprise-only global settings
include_router_with_global_prefix_prepended(

View File

@@ -99,6 +99,26 @@ async def get_or_provision_tenant(
tenant_id = await get_available_tenant()
if tenant_id:
# Run migrations to ensure the pre-provisioned tenant schema is current.
# Pool tenants may have been created before a new migration was deployed.
# Capture as a non-optional local so mypy can type the lambda correctly.
_tenant_id: str = tenant_id
loop = asyncio.get_running_loop()
try:
await loop.run_in_executor(
None, lambda: run_alembic_migrations(_tenant_id)
)
except Exception:
# The tenant was already dequeued from the pool — roll it back so
# it doesn't end up orphaned (schema exists, but not assigned to anyone).
logger.exception(
f"Migration failed for pre-provisioned tenant {_tenant_id}; rolling back"
)
try:
await rollback_tenant_provisioning(_tenant_id)
except Exception:
logger.exception(f"Failed to rollback orphaned tenant {_tenant_id}")
raise
# If we have a pre-provisioned tenant, assign it to the user
await assign_tenant_to_user(tenant_id, email, referral_source)
logger.info(f"Assigned pre-provisioned tenant {tenant_id} to user {email}")

View File

@@ -100,6 +100,7 @@ def get_model_app() -> FastAPI:
dsn=SENTRY_DSN,
integrations=[StarletteIntegration(), FastApiIntegration()],
traces_sample_rate=0.1,
release=__version__,
)
logger.info("Sentry initialized")
else:

View File

@@ -20,6 +20,7 @@ from sentry_sdk.integrations.celery import CeleryIntegration
from sqlalchemy import text
from sqlalchemy.orm import Session
from onyx import __version__
from onyx.background.celery.apps.task_formatters import CeleryTaskColoredFormatter
from onyx.background.celery.apps.task_formatters import CeleryTaskPlainFormatter
from onyx.background.celery.celery_utils import celery_is_worker_primary
@@ -65,6 +66,7 @@ if SENTRY_DSN:
dsn=SENTRY_DSN,
integrations=[CeleryIntegration()],
traces_sample_rate=0.1,
release=__version__,
)
logger.info("Sentry initialized")
else:
@@ -515,7 +517,8 @@ def reset_tenant_id(
def wait_for_vespa_or_shutdown(
sender: Any, **kwargs: Any # noqa: ARG001
sender: Any, # noqa: ARG001
**kwargs: Any, # noqa: ARG001
) -> None: # noqa: ARG001
"""Waits for Vespa to become ready subject to a timeout.
Raises WorkerShutdown if the timeout is reached."""

View File

@@ -317,7 +317,6 @@ celery_app.autodiscover_tasks(
"onyx.background.celery.tasks.docprocessing",
"onyx.background.celery.tasks.evals",
"onyx.background.celery.tasks.hierarchyfetching",
"onyx.background.celery.tasks.hooks",
"onyx.background.celery.tasks.periodic",
"onyx.background.celery.tasks.pruning",
"onyx.background.celery.tasks.shared",

View File

@@ -14,7 +14,6 @@ from onyx.configs.constants import ONYX_CLOUD_CELERY_TASK_PREFIX
from onyx.configs.constants import OnyxCeleryPriority
from onyx.configs.constants import OnyxCeleryQueues
from onyx.configs.constants import OnyxCeleryTask
from onyx.hooks.utils import HOOKS_AVAILABLE
from shared_configs.configs import MULTI_TENANT
# choosing 15 minutes because it roughly gives us enough time to process many tasks
@@ -362,19 +361,6 @@ if not MULTI_TENANT:
tasks_to_schedule.extend(beat_task_templates)
if HOOKS_AVAILABLE:
tasks_to_schedule.append(
{
"name": "hook-execution-log-cleanup",
"task": OnyxCeleryTask.HOOK_EXECUTION_LOG_CLEANUP_TASK,
"schedule": timedelta(days=1),
"options": {
"priority": OnyxCeleryPriority.LOW,
"expires": BEAT_EXPIRES_DEFAULT,
},
}
)
def generate_cloud_tasks(
beat_tasks: list[dict], beat_templates: list[dict], beat_multiplier: float

View File

@@ -9,6 +9,7 @@ from celery import Celery
from celery import shared_task
from celery import Task
from onyx import __version__
from onyx.background.celery.apps.app_base import task_logger
from onyx.background.celery.memory_monitoring import emit_process_memory
from onyx.background.celery.tasks.docprocessing.heartbeat import start_heartbeat
@@ -137,6 +138,7 @@ def _docfetching_task(
sentry_sdk.init(
dsn=SENTRY_DSN,
traces_sample_rate=0.1,
release=__version__,
)
logger.info("Sentry initialized")
else:

View File

@@ -319,6 +319,11 @@ def monitor_indexing_attempt_progress(
)
current_db_time = get_db_current_time(db_session)
total_batches: int | str = (
coordination_status.total_batches
if coordination_status.total_batches is not None
else "?"
)
if coordination_status.found:
task_logger.info(
f"Indexing attempt progress: "
@@ -326,7 +331,7 @@ def monitor_indexing_attempt_progress(
f"cc_pair={attempt.connector_credential_pair_id} "
f"search_settings={attempt.search_settings_id} "
f"completed_batches={coordination_status.completed_batches} "
f"total_batches={coordination_status.total_batches or '?'} "
f"total_batches={total_batches} "
f"total_docs={coordination_status.total_docs} "
f"total_failures={coordination_status.total_failures}"
f"elapsed={(current_db_time - attempt.time_created).seconds}"
@@ -410,7 +415,7 @@ def check_indexing_completion(
logger.info(
f"Indexing status: "
f"indexing_completed={indexing_completed} "
f"batches_processed={batches_processed}/{batches_total or '?'} "
f"batches_processed={batches_processed}/{batches_total if batches_total is not None else '?'} "
f"total_docs={coordination_status.total_docs} "
f"total_chunks={coordination_status.total_chunks} "
f"total_failures={coordination_status.total_failures}"

View File

@@ -1,25 +1,33 @@
# Overview of Context Management
This document reviews some design decisions around the main agent-loop powering Onyx's chat flow.
It is highly recommended for all engineers contributing to this flow to be familiar with the concepts here.
> Note: it is assumed the reader is familiar with the Onyx product and features such as Projects, User files, Citations, etc.
## System Prompt
The system prompt is a default prompt that comes packaged with the system. Users can edit the default prompt and it will be persisted in the database.
Some parts of the system prompt are dynamically updated / inserted:
- Datetime of the message sent
- Tools description of when to use certain tools depending on if the tool is available in that cycle
- If the user has just called a search related tool, then a section about citations is included
## Custom Agent Prompt
The custom agent is inserted as a user message above the most recent user message, it is dynamically moved in the history as the user sends more messages.
If the user has opted to completely replace the System Prompt, then this Custom Agent prompt replaces the system prompt and does not move along the history.
## How Files are handled
On upload, Files are processed for tokens, if too many tokens to fit in the context, its considered a failed inclusion. This is done using the LLM tokenizer.
- In many cases, there is not a known tokenizer for each LLM so there is a default tokenizer used as a catchall.
- File upload happens in 2 parts - the actual upload + token counting.
- Files are added into chat context as a “point in time” inclusion and move up the context window as the conversation progresses.
Every file knows how many tokens it is (model agnostic), image files have some assumed number of tokens.
Every file knows how many tokens it is (model agnostic), image files have some assumed number of tokens.
Image files are attached to User Messages also as point in time inclusions.
@@ -27,8 +35,8 @@ Image files are attached to User Messages also as point in time inclusions.
Files selected from the search results are also counted as “point in time” inclusions. Files that are too large cannot be selected.
For these files, the "entire file" does not exist for most connectors, it's pieced back together from the search engine.
## Projects
If a Project contains few enough files that it all fits in the model context, we keep it close enough in the history to ensure it is easy for the LLM to
access. Note that the project documents are assumed to be quite useful and that they should 1. never be dropped from context, 2. is not just a needle in
a haystack type search with a strong keyword to make the LLM attend to it.
@@ -36,11 +44,12 @@ a haystack type search with a strong keyword to make the LLM attend to it.
Project files are vectorized and stored in the Search Engine so that if the user chooses a model with less context than the number of tokens in the project,
the system can RAG over the project files.
## How documents are represented
Documents from search or uploaded Project files are represented as a json so that the LLM can easily understand it. It is represented with a prefix to make the
context clearer to the LLM. Note that for search results (whether web or internal, it will just be the json) and it will be a Tool Call type of message
rather than a user message.
Documents from search or uploaded Project files are represented as a json so that the LLM can easily understand it. It is represented with a prefix string to
make the context clearer to the LLM. Note that for search results (whether web or internal, it will just be the json) and it will be a Tool Call type of
message rather than a user message.
```
Here are some documents provided for context, they may not all be relevant:
{
@@ -50,33 +59,37 @@ Here are some documents provided for context, they may not all be relevant:
]
}
```
Documents are represented with document so that the LLM can easily cite them with a single number. The tool returns have to be richer to be able to
Documents are represented with the `document` key so that the LLM can easily cite them with a single number. The tool returns have to be richer to be able to
translate this into links and other UI elements. What the LLM sees is far simpler to reduce noise/hallucinations.
Note that documents included in a single turn should be collapsed into a single user message.
Search tools give URLs to the LLM though so that open_url (a separate tool) can be called on them.
Search tools also give URLs to the LLM so that open_url (a separate tool) can be called on them.
## Reminders
To ensure the LLM follows certain specific instructions, instructions are added at the very end of the chat context as a user message. If a search related
tool is used, a citation reminder is always added. Otherwise, by default there is no reminder. If the user configures reminders, those are added to the
final message. If a search related tool just ran and the user has reminders, both appear in a single message.
If a search related tool is called at any point during the turn, the reminder will remain at the end until the turn is over and the agent has responded.
## Tool Calls
As tool call responses can get very long (like an internal search can be many thousands of tokens), tool responses are today replaced with a hardcoded
As tool call responses can get very long (like an internal search can be many thousands of tokens), tool responses are current replaced with a hardcoded
string saying it is no longer available. Tool Call details like the search query and other arguments are kept in the history as this is information
rich and generally very few tokens.
> Note: in the Internal Search flow with query expansion, the Tool Call which was actually run differs from what the LLM provided as arguments.
> What the LLM sees in the history (to be most informative for future calls) is the full set of expanded queries.
**Possible Future Extension**:
Instead of dropping the Tool Call response, we might summarize it using an LLM so that it is just 1-2 sentences and captures the main points. That said,
this is questionable value add because anything relevant and useful should be already captured in the Agent response.
## Examples
```
S -> System Message
CA -> Custom Agent as a User Message
@@ -98,15 +111,15 @@ Flow with Project and File Upload
S, CA, P, F, U1, A1 -- user sends another message -> S, F, U1, A1, CA, P, U2, A2
- File stays in place, above the user message
- Project files move along the chain as new messages are sent
- Custom Agent prompt comes before project files which comes before user uploaded files in each turn
- Custom Agent prompt comes before project files which come before user uploaded files in each turn
Reminders during a single Turn
S, U1, TC, TR, R -- agent calls another tool -> S, U1, TC, TR, TC, TR, R, A1
- Reminder moved to the end
```
## Product considerations
Project files are important to the entire duration of the chat session. If the user has uploaded project files, they are likely very intent on working with
those files. The LLM is much better at referencing documents close to the end of the context window so keeping it there for ease of access.
@@ -117,9 +130,9 @@ User Message further away. This tradeoff is accepted for Projects because of the
Reminder are absolutely necessary to ensure 1-2 specific instructions get followed with a very high probability. It is less detailed than the system prompt
and should be very targetted for it to work reliably and also not interfere with the last user message.
## Reasons / Experiments
Custom Agent instructions being placed in the system prompt is poorly followed. It also degrade performance of the system especially when the instructions
Custom Agent instructions being placed in the system prompt is poorly followed. It also degrades performance of the system especially when the instructions
are orthogonal (or even possibly contradictory) to the system prompt. For weaker models, it causes strange artifacts in tool calls and final responses
that completely ruins the user experience. Empirically, this way works better across a range of models especially when the history gets longer.
Having the Custom Agent instructions not move means it fades more as the chat gets long which is also not ok from a UX perspective.
@@ -146,10 +159,10 @@ In a similar concept, LLM instructions in the system prompt are structured speci
fairly surprising actually but if there is a line of instructions effectively saying "If you try to use some tools and find that you need more information or
need to call additional tools, you are encouraged to do this", having this in the Tool section of the System prompt makes all the LLMs follow it well but if it's
even just a paragraph away like near the beginning of the prompt, it is often ignored. The difference is as drastic as a 30% follow rate to a 90% follow
rate even just moving the same statement a few sentences.
rate by even just moving the same statement a few sentences.
## Other related pointers
- How messages, files, images are stored can be found in backend/onyx/db/models.py, there is also a README.md under that directory that may be helpful.
---
@@ -160,32 +173,38 @@ rate even just moving the same statement a few sentences.
Turn: User sends a message and AI does some set of things and responds
Step/Cycle: 1 single LLM inference given some context and some tools
## 1. Top Level (process_message function):
This function can be thought of as the set-up and validation layer. It ensures that the database is in a valid state, reads the
messages in the session and sets up all the necessary items to run the chat loop and state containers. The major things it does
are:
- Validates the request
- Builds the chat history for the session
- Fetches any additional context such as files and images
- Prepares all of the tools for the LLM
- Creates the state container objects for use in the loop
### Wrapper (run_chat_loop_with_state_containers function):
This wrapper is used to run the LLM flow in a background thread and monitor the emitter for stop signals. This means the top
level is as isolated from the LLM flow as possible and can continue to yield packets as soon as they are available from the lower
levels. This also means that if the lower levels fail, the top level will still guarantee a reasonable response to the user.
All of the saving and database operations are abstracted away from the lower levels.
### Execution (`_run_models` function):
Each model runs in its own worker thread inside a `ThreadPoolExecutor`. Workers write packets to a shared
`merged_queue` via an `Emitter`; the main thread drains the queue and yields packets in arrival order. This
means the top level is isolated from the LLM flow and can yield packets as soon as they are produced. If a
worker fails, the main thread yields a `StreamingError` for that model and keeps the other models running.
All saving and database operations are handled by the main thread after the workers complete (or by the
workers themselves via self-completion if the drain loop exits early).
### Emitter
The emitter is designed to be an object queue so that lower levels do not need to yield objects all the way back to the top.
This way the functions can be better designed (not everything as a generator) and more easily tested. The wrapper around the
LLM flow (run_chat_loop_with_state_containers) is used to monitor the emitter and handle packets as soon as they are available
from the lower levels. Both the emitter and the state container are mutating state objects and only used to accumulate state.
There should be no logic dependent on the states of these objects, especially in the lower levels. The emitter should only take
packets and should not be used for other things.
The emitter is an object that lower levels use to send packets without needing to yield them all the way back
up the call stack. Each `Emitter` tags every packet with a `model_index` and places it on the shared
`merged_queue` as a `(model_idx, packet)` tuple. The drain loop in `_run_models` consumes these tuples and
yields the packets to the caller. Both the emitter and the state container are mutating state objects used
only to accumulate state. There should be no logic dependent on the states of these objects, especially in
the lower levels. The emitter should only take packets and should not be used for other things.
### State Container
The state container is used to accumulate state during the LLM flow. Similar to the emitter, it should not be used for logic,
only for accumulating state. It is used to gather all of the necessary information for saving the chat turn into the database.
So it will accumulate answer tokens, reasoning tokens, tool calls, citation info, etc. This is used at the end of the flow once
@@ -193,35 +212,40 @@ the lower level is completed whether on its own or stopped by the user. At that
the database. The state container can be added to by any of the underlying layers, this is fine.
### Stopping Generation
A stop signal is checked every 300ms by the wrapper around the LLM flow. The signal itself
is stored in Redis and is set by the user calling the stop endpoint. The wrapper ensures that no matter what the lower level is
doing at the time, the thread can be killed by the top level. It does not require a cooperative cancellation from the lower level
and in fact the lower level does not know about the stop signal at all.
The drain loop in `_run_models` checks `check_is_connected()` every 50 ms (on queue timeout). The signal itself
is stored in Redis and is set by the user calling the stop endpoint. On disconnect, the drain loop saves
partial state for every model, yields an `OverallStop(stop_reason="user_cancelled")` packet, and returns.
A `drain_done` event signals emitters to stop blocking so worker threads can exit quickly. Workers that
already completed successfully will self-complete (persist their response) if the drain loop exited before
reaching the normal completion path.
## 2. LLM Loop (run_llm_loop function)
This function handles the logic of the Turn. It's essentially a while loop where context is added and modified (according what
is outlined in the first half of this doc). Its main functionality is:
- Translate and truncate the context for the LLM inference
- Add context modifiers like reminders, updates to the system prompts, etc.
- Run tool calls and gather results
- Build some of the objects stored in the state container.
## 3. LLM Step (run_llm_step function)
This function is a single inference of the LLM. It's a wrapper around the LLM stream function which handles packet translations
so that the Emitter can emit individual tokens as soon as they arrive. It also keeps track of the different sections since they
do not all come at once (reasoning, answers, tool calls are all built up token by token). This layer also tracks the different
tool calls and returns that to the LLM Loop to execute.
## Things to know
- Packets are labeled with a "turn_index" field as part of the Placement of the packet. This is not the same as the backend
concept of a turn. The turn_index for the frontend is which block does this packet belong to. So while a reasoning + tool call
comes from the same LLM inference (same backend LLM step), they are 2 turns to the frontend because that's how it's rendered.
- There are 3 representations of "message". The first is the database model ChatMessage, this one should be translated away and
not used deep into the flow. The second is ChatMessageSimple which is the data model which should be used throughout the code
as much as possible. If modifications/additions are needed, it should be to this object. This is the rich representation of a
message for the code. Finally there is the LanguageModelInput representation of a message. This one is for the LLM interface
layer and is as stripped down as possible so that the LLM interface can be clean and easy to maintain/extend.
- Packets are labeled with a "turn_index" field as part of the Placement of the packet. This is not the same as the backend
concept of a turn. The turn_index for the frontend is which block does this packet belong to. So while a reasoning + tool call
comes from the same LLM inference (same backend LLM step), they are 2 turns to the frontend because that's how it's rendered.
- There are 3 representations of a message, each scoped to a different layer:
1. **ChatMessage** — The database model. Should be converted into ChatMessageSimple early and never passed deep into the flow.
2. **ChatMessageSimple** — The canonical data model used throughout the codebase. This is the rich, full-featured representation
of a message. Any modifications or additions to message structure should be made here.
3. **LanguageModelInput** — The LLM-facing representation. Intentionally minimal so the LLM interface layer stays clean and
easy to maintain/extend.

View File

@@ -1,7 +1,27 @@
import threading
from collections.abc import Callable
from dataclasses import dataclass
from uuid import UUID
from pydantic import BaseModel
from onyx.cache.interface import CacheBackend
from onyx.chat.citation_processor import CitationMapping
from onyx.chat.models import ChatLoadedFile
from onyx.chat.models import ChatMessageSimple
from onyx.chat.models import ExtractedContextFiles
from onyx.chat.models import FileToolMetadata
from onyx.chat.models import SearchParams
from onyx.context.search.models import SearchDoc
from onyx.db.memory import UserMemoryContext
from onyx.db.models import ChatMessage
from onyx.db.models import ChatSession
from onyx.db.models import Persona
from onyx.llm.interfaces import LLM
from onyx.llm.interfaces import LLMUserIdentity
from onyx.onyxbot.slack.models import SlackContext
from onyx.server.query_and_chat.models import SendMessageRequest
from onyx.tools.models import ChatFile
from onyx.tools.models import ToolCallInfo
# Type alias for search doc deduplication key
@@ -148,3 +168,47 @@ class ChatStateContainer:
"""Thread-safe getter for emitted citations (returns a copy)."""
with self._lock:
return self._emitted_citations.copy()
class AvailableFiles(BaseModel):
"""Separated file IDs for the FileReaderTool so it knows which loader to use."""
# IDs from the ``user_file`` table (project / persona-attached files).
user_file_ids: list[UUID] = []
# IDs from the ``file_record`` table (chat-attached files).
chat_file_ids: list[UUID] = []
@dataclass(frozen=True)
class ChatTurnSetup:
"""Immutable context produced by ``build_chat_turn`` and consumed by ``_run_models``."""
new_msg_req: SendMessageRequest
chat_session: ChatSession
persona: Persona
user_message: ChatMessage
user_identity: LLMUserIdentity
llms: list[LLM] # length 1 for single-model, N for multi-model
model_display_names: list[str] # parallel to llms
simple_chat_history: list[ChatMessageSimple]
extracted_context_files: ExtractedContextFiles
reserved_messages: list[ChatMessage] # length 1 for single, N for multi
reserved_token_count: int
search_params: SearchParams
all_injected_file_metadata: dict[str, FileToolMetadata]
available_files: AvailableFiles
tool_id_to_name_map: dict[int, str]
forced_tool_id: int | None
files: list[ChatLoadedFile]
chat_files_for_tools: list[ChatFile]
custom_agent_prompt: str | None
user_memory_context: UserMemoryContext
# For deep research: was the last assistant message a clarification request?
skip_clarification: bool
check_is_connected: Callable[[], bool]
cache: CacheBackend
# Execution params forwarded to per-model tool construction
bypass_acl: bool
slack_context: SlackContext | None
custom_tool_additional_headers: dict[str, str] | None
mcp_headers: dict[str, str] | None

View File

@@ -5,6 +5,7 @@ from typing import cast
from uuid import UUID
from fastapi.datastructures import Headers
from pydantic import BaseModel
from sqlalchemy.orm import Session
from onyx.chat.models import ChatHistoryResult
@@ -51,6 +52,60 @@ logger = setup_logger()
IMAGE_GENERATION_TOOL_NAME = "generate_image"
class FileContextResult(BaseModel):
"""Result of building a file's LLM context representation."""
message: ChatMessageSimple
tool_metadata: FileToolMetadata
def build_file_context(
tool_file_id: str,
filename: str,
file_type: ChatFileType,
content_text: str | None = None,
token_count: int = 0,
approx_char_count: int | None = None,
) -> FileContextResult:
"""Build the LLM context representation for a single file.
Centralises how files should appear in the LLM prompt
— the ID that FileReaderTool accepts (``UserFile.id`` for user files).
"""
if file_type.use_metadata_only():
message_text = (
f"File: {filename} (id={tool_file_id})\n"
"Use the file_reader or python tools to access "
"this file's contents."
)
message = ChatMessageSimple(
message=message_text,
token_count=max(1, len(message_text) // 4),
message_type=MessageType.USER,
file_id=tool_file_id,
)
else:
message_text = f"File: {filename}\n{content_text or ''}\nEnd of File"
message = ChatMessageSimple(
message=message_text,
token_count=token_count,
message_type=MessageType.USER,
file_id=tool_file_id,
)
metadata = FileToolMetadata(
file_id=tool_file_id,
filename=filename,
approx_char_count=(
approx_char_count
if approx_char_count is not None
else len(content_text or "")
),
)
return FileContextResult(message=message, tool_metadata=metadata)
def create_chat_session_from_request(
chat_session_request: ChatSessionCreationRequest,
user_id: UUID | None,
@@ -538,7 +593,7 @@ def convert_chat_history(
for idx, chat_message in enumerate(chat_history):
if chat_message.message_type == MessageType.USER:
# Process files attached to this message
text_files: list[ChatLoadedFile] = []
text_files: list[tuple[ChatLoadedFile, FileDescriptor]] = []
image_files: list[ChatLoadedFile] = []
if chat_message.files:
@@ -549,34 +604,26 @@ def convert_chat_history(
if loaded_file.file_type == ChatFileType.IMAGE:
image_files.append(loaded_file)
else:
# Text files (DOC, PLAIN_TEXT, CSV) are added as separate messages
text_files.append(loaded_file)
# Text files (DOC, PLAIN_TEXT, TABULAR) are added as separate messages
text_files.append((loaded_file, file_descriptor))
# Add text files as separate messages before the user message.
# Each message is tagged with ``file_id`` so that forgotten files
# can be detected after context-window truncation.
for text_file in text_files:
file_text = text_file.content_text or ""
filename = text_file.filename
message = (
f"File: {filename}\n{file_text}\nEnd of File"
if filename
else file_text
)
simple_messages.append(
ChatMessageSimple(
message=message,
token_count=text_file.token_count,
message_type=MessageType.USER,
image_files=None,
file_id=text_file.file_id,
)
)
all_injected_file_metadata[text_file.file_id] = FileToolMetadata(
file_id=text_file.file_id,
filename=filename or "unknown",
approx_char_count=len(file_text),
for text_file, fd in text_files:
# Use user_file_id as the FileReaderTool accepts that.
# Fall back to the file-store path id.
tool_id = fd.get("user_file_id") or text_file.file_id
filename = text_file.filename or "unknown"
ctx = build_file_context(
tool_file_id=tool_id,
filename=filename,
file_type=text_file.file_type,
content_text=text_file.content_text,
token_count=text_file.token_count,
)
simple_messages.append(ctx.message)
all_injected_file_metadata[tool_id] = ctx.tool_metadata
# Sum token counts from image files (excluding project image files)
image_token_count = (

View File

@@ -13,17 +13,17 @@ from collections.abc import Callable
from collections.abc import Generator
from concurrent.futures import ThreadPoolExecutor
from contextvars import Token
from dataclasses import dataclass
from typing import Final
from uuid import UUID
from pydantic import BaseModel
from sqlalchemy.orm import Session
from onyx.cache.factory import get_cache_backend
from onyx.cache.interface import CacheBackend
from onyx.chat.chat_processing_checker import set_processing_status
from onyx.chat.chat_state import AvailableFiles
from onyx.chat.chat_state import ChatStateContainer
from onyx.chat.chat_state import ChatTurnSetup
from onyx.chat.chat_utils import build_file_context
from onyx.chat.chat_utils import convert_chat_history
from onyx.chat.chat_utils import create_chat_history_chain
from onyx.chat.chat_utils import create_chat_session_from_request
@@ -70,9 +70,7 @@ from onyx.db.chat import reserve_multi_model_message_ids
from onyx.db.engine.sql_engine import get_session_with_current_tenant
from onyx.db.enums import HookPoint
from onyx.db.memory import get_memories
from onyx.db.memory import UserMemoryContext
from onyx.db.models import ChatMessage
from onyx.db.models import ChatSession
from onyx.db.models import Persona
from onyx.db.models import User
from onyx.db.models import UserFile
@@ -101,6 +99,7 @@ from onyx.llm.request_context import reset_llm_mock_response
from onyx.llm.request_context import set_llm_mock_response
from onyx.llm.utils import litellm_exception_to_error_msg
from onyx.onyxbot.slack.models import SlackContext
from onyx.server.query_and_chat.chat_utils import mime_type_to_chat_file_type
from onyx.server.query_and_chat.models import AUTO_PLACE_AFTER_LATEST_MESSAGE
from onyx.server.query_and_chat.models import MessageResponseIDInfo
from onyx.server.query_and_chat.models import ModelResponseSlot
@@ -128,15 +127,7 @@ from shared_configs.contextvars import get_current_tenant_id
logger = setup_logger()
ERROR_TYPE_CANCELLED = "cancelled"
class _AvailableFiles(BaseModel):
"""Separated file IDs for the FileReaderTool so it knows which loader to use."""
# IDs from the ``user_file`` table (project / persona-attached files).
user_file_ids: list[UUID] = []
# IDs from the ``file_record`` table (chat-attached files).
chat_file_ids: list[UUID] = []
APPROX_CHARS_PER_TOKEN = 4
def _collect_available_file_ids(
@@ -144,7 +135,7 @@ def _collect_available_file_ids(
project_id: int | None,
user_id: UUID | None,
db_session: Session,
) -> _AvailableFiles:
) -> AvailableFiles:
"""Collect all file IDs the FileReaderTool should be allowed to access.
Returns *separate* lists for chat-attached files (``file_record`` IDs) and
@@ -171,7 +162,7 @@ def _collect_available_file_ids(
for uf in user_files:
user_file_ids.add(uf.id)
return _AvailableFiles(
return AvailableFiles(
user_file_ids=list(user_file_ids),
chat_file_ids=list(chat_file_ids),
)
@@ -313,16 +304,27 @@ def extract_context_files(
if not user_files:
return _empty_extracted_context_files()
aggregate_tokens = sum(uf.token_count or 0 for uf in user_files)
# Aggregate tokens for the file content that will be added
# Skip tokens for those with metadata only
aggregate_tokens = sum(
uf.token_count or 0
for uf in user_files
if not mime_type_to_chat_file_type(uf.file_type).use_metadata_only()
)
max_actual_tokens = (
llm_max_context_window - reserved_token_count
) * max_llm_context_percentage
if aggregate_tokens >= max_actual_tokens:
tool_metadata = []
use_as_search_filter = not DISABLE_VECTOR_DB
if DISABLE_VECTOR_DB:
tool_metadata = _build_file_tool_metadata_for_user_files(user_files)
overflow_tool_metadata = [_build_tool_metadata(uf) for uf in user_files]
else:
overflow_tool_metadata = [
_build_tool_metadata(uf)
for uf in user_files
if mime_type_to_chat_file_type(uf.file_type).use_metadata_only()
]
return ExtractedContextFiles(
file_texts=[],
image_files=[],
@@ -330,11 +332,11 @@ def extract_context_files(
total_token_count=0,
file_metadata=[],
uncapped_token_count=aggregate_tokens,
file_metadata_for_tool=tool_metadata,
file_metadata_for_tool=overflow_tool_metadata,
)
# Files fit — load them into context
user_file_map = {str(uf.id): uf for uf in user_files}
user_file_map = {uf.file_id: uf for uf in user_files}
in_memory_files = load_in_memory_chat_files(
user_file_ids=[uf.id for uf in user_files],
db_session=db_session,
@@ -343,23 +345,38 @@ def extract_context_files(
file_texts: list[str] = []
image_files: list[ChatLoadedFile] = []
file_metadata: list[ContextFileMetadata] = []
tool_metadata: list[FileToolMetadata] = []
total_token_count = 0
for f in in_memory_files:
uf = user_file_map.get(str(f.file_id))
if f.file_type.is_text_file():
filename = f.filename or f"file_{f.file_id}"
if f.file_type.use_metadata_only():
# Metadata-only files are not injected as full text.
# Only the metadata is provided, with LLM using tools
if not uf:
logger.error(
f"File with id={f.file_id} in metadata-only path with no associated user file"
)
continue
tool_metadata.append(_build_tool_metadata(uf))
elif f.file_type.is_text_file():
text_content = _extract_text_from_in_memory_file(f)
if not text_content:
continue
if not uf:
logger.warning(f"No user file for file_id={f.file_id}")
continue
file_texts.append(text_content)
file_metadata.append(
ContextFileMetadata(
file_id=str(f.file_id),
filename=f.filename or f"file_{f.file_id}",
file_id=str(uf.id),
filename=filename,
file_content=text_content,
)
)
if uf and uf.token_count:
if uf.token_count:
total_token_count += uf.token_count
elif f.file_type == ChatFileType.IMAGE:
token_count = uf.token_count if uf and uf.token_count else 0
@@ -382,24 +399,22 @@ def extract_context_files(
total_token_count=total_token_count,
file_metadata=file_metadata,
uncapped_token_count=aggregate_tokens,
file_metadata_for_tool=tool_metadata,
)
APPROX_CHARS_PER_TOKEN = 4
def _build_tool_metadata(user_file: UserFile) -> FileToolMetadata:
"""Build lightweight FileToolMetadata from a UserFile record.
def _build_file_tool_metadata_for_user_files(
user_files: list[UserFile],
) -> list[FileToolMetadata]:
"""Build lightweight FileToolMetadata from a list of UserFile records."""
return [
FileToolMetadata(
file_id=str(uf.id),
filename=uf.name,
approx_char_count=(uf.token_count or 0) * APPROX_CHARS_PER_TOKEN,
)
for uf in user_files
]
Delegates to ``build_file_context`` so that the file ID exposed to the
LLM is always consistent with what FileReaderTool expects.
"""
return build_file_context(
tool_file_id=str(user_file.id),
filename=user_file.name,
file_type=mime_type_to_chat_file_type(user_file.file_type),
approx_char_count=(user_file.token_count or 0) * APPROX_CHARS_PER_TOKEN,
).tool_metadata
def determine_search_params(
@@ -467,41 +482,6 @@ def _resolve_query_processing_hook_result(
return hook_result.query.strip()
@dataclass(frozen=True)
class ChatTurnSetup:
"""Immutable context produced by ``build_chat_turn`` and consumed by ``_run_models``."""
new_msg_req: SendMessageRequest
chat_session: ChatSession
persona: Persona
user_message: ChatMessage
user_identity: LLMUserIdentity
llms: list[LLM] # length 1 for single-model, N for multi-model
model_display_names: list[str] # parallel to llms
simple_chat_history: list[ChatMessageSimple]
extracted_context_files: ExtractedContextFiles
reserved_messages: list[ChatMessage] # length 1 for single, N for multi
reserved_token_count: int
search_params: SearchParams
all_injected_file_metadata: dict[str, FileToolMetadata]
available_files: _AvailableFiles
tool_id_to_name_map: dict[int, str]
forced_tool_id: int | None
files: list[ChatLoadedFile]
chat_files_for_tools: list[ChatFile]
custom_agent_prompt: str | None
user_memory_context: UserMemoryContext
# For deep research: was the last assistant message a clarification request?
skip_clarification: bool
check_is_connected: Callable[[], bool]
cache: CacheBackend
# Execution params forwarded to per-model tool construction
bypass_acl: bool
slack_context: SlackContext | None
custom_tool_additional_headers: dict[str, str] | None
mcp_headers: dict[str, str] | None
def build_chat_turn(
new_msg_req: SendMessageRequest,
user: User,
@@ -595,41 +575,32 @@ def build_chat_turn(
# Check LLM cost limits before using the LLM (only for Onyx-managed keys),
# then build the LLM instance(s).
if is_multi:
assert llm_overrides is not None
llms: list[LLM] = []
model_display_names: list[str] = []
for override in llm_overrides:
llm = get_llm_for_persona(
persona=persona,
user=user,
llm_override=override,
additional_headers=litellm_additional_headers,
)
check_llm_cost_limit_for_provider(
db_session=db_session,
tenant_id=tenant_id,
llm_provider_api_key=llm.config.api_key,
)
llms.append(llm)
model_display_names.append(_build_model_display_name(override))
# Use first LLM for token counting — model-agnostic for setup purposes
token_counter = get_llm_token_counter(llms[0])
else:
primary_llm = get_llm_for_persona(
llms: list[LLM] = []
model_display_names: list[str] = []
selected_overrides: list[LLMOverride | None] = (
list(llm_overrides or [])
if is_multi
else [new_msg_req.llm_override or chat_session.llm_override]
)
for override in selected_overrides:
llm = get_llm_for_persona(
persona=persona,
user=user,
llm_override=new_msg_req.llm_override or chat_session.llm_override,
llm_override=override,
additional_headers=litellm_additional_headers,
)
check_llm_cost_limit_for_provider(
db_session=db_session,
tenant_id=tenant_id,
llm_provider_api_key=primary_llm.config.api_key,
llm_provider_api_key=llm.config.api_key,
)
llms = [primary_llm]
llms.append(llm)
model_display_names.append(_build_model_display_name(override))
token_counter = get_llm_token_counter(llms[0])
# not sure why we do this, but to maintain parity with previous code:
if not is_multi:
model_display_names = [""]
token_counter = get_llm_token_counter(primary_llm)
# Verify that the user-specified files actually belong to the user
verify_user_files(
@@ -790,12 +761,8 @@ def build_chat_turn(
db_session=db_session,
)
# For multi-model: use the smallest context window across all models for safety
llm_max_context_window = (
min(llm.config.max_input_tokens for llm in llms)
if is_multi
else llms[0].config.max_input_tokens
)
# Use the smallest context window across models for safety (harmless for N=1).
llm_max_context_window = min(llm.config.max_input_tokens for llm in llms)
extracted_context_files = extract_context_files(
user_files=context_user_files,
@@ -1024,15 +991,9 @@ def _run_models(
model_errored: list[bool] = [False] * n_models
# Set when the drain loop exits early (HTTP disconnect / GeneratorExit).
# Signals emitters to skip future puts and workers to self-complete.
# Signals emitters to skip future puts so workers exit promptly.
drain_done = threading.Event()
# One lock per model. acquire(blocking=False) claims completion responsibility.
# Whichever caller (worker self-complete or disconnect handler) acquires first
# is the one that calls llm_loop_completion_handle for that model.
# Locks are intentionally never released — they serve as permanent "claimed" markers.
completion_locks: list[threading.Lock] = [threading.Lock() for _ in range(n_models)]
def _run_model(model_idx: int) -> None:
"""Run one LLM loop inside a worker thread, writing packets to ``merged_queue``."""
model_emitter = Emitter(
@@ -1141,36 +1102,6 @@ def _run_models(
finally:
merged_queue.put((model_idx, _MODEL_DONE))
# Self-completion on disconnect: if the drain loop exited early (drain_done is set)
# AND we succeeded AND we can claim the completion lock, persist our response.
# The lock prevents a double-completion race with the disconnect handler in the
# _run_models finally block, which runs concurrently on the main thread.
if (
drain_done.is_set()
and model_succeeded[model_idx]
and completion_locks[model_idx].acquire(blocking=False)
):
try:
with get_session_with_current_tenant() as self_complete_db:
assistant_message = self_complete_db.get(
ChatMessage, setup.reserved_messages[model_idx].id
)
if assistant_message is not None:
llm_loop_completion_handle(
state_container=state_containers[model_idx],
is_connected=lambda: True,
db_session=self_complete_db,
assistant_message=assistant_message,
llm=setup.llms[model_idx],
reserved_tokens=setup.reserved_token_count,
)
except Exception:
logger.exception(
"model %d (%s): self-completion after disconnect failed",
model_idx,
setup.model_display_names[model_idx],
)
def _delete_orphaned_message(model_idx: int, context: str) -> None:
"""Delete a reserved ChatMessage that was never populated due to a model error."""
try:
@@ -1307,17 +1238,19 @@ def _run_models(
else:
# Early exit (GeneratorExit from raw HTTP disconnect, or unhandled
# exception in the drain loop).
# 1. Signal emitters to stop blocking — future emit() calls return immediately.
# 1. Signal emitters to stop — future emit() calls return immediately,
# so workers exit their LLM loops promptly.
drain_done.set()
# 2. Complete models that already finished before the disconnect fired (B1 fix).
# These workers exited _run_model before drain_done was set and won't
# self-complete. Use completion_locks to prevent double-completion with any
# worker that finishes and self-completes concurrently.
# 2. Wait for all workers to finish. Once drain_done is set the Emitter
# short-circuits, so workers should exit quickly.
executor.shutdown(wait=True)
# 3. All workers are done — complete from the main thread only.
for i in range(n_models):
if model_succeeded[i] and completion_locks[i].acquire(blocking=False):
if model_succeeded[i]:
try:
llm_loop_completion_handle(
state_container=state_containers[i],
# Model already finished — persist full response.
is_connected=lambda: True,
db_session=db_session,
assistant_message=setup.reserved_messages[i],
@@ -1331,17 +1264,13 @@ def _run_models(
setup.model_display_names[i],
)
elif model_errored[i]:
# Model errored before the disconnect — delete its orphaned message.
_delete_orphaned_message(i, "disconnect")
# 3. Drain buffered packets from memory — no consumer is running.
# 4. Drain buffered packets from memory — no consumer is running.
while not merged_queue.empty():
try:
merged_queue.get_nowait()
except queue.Empty:
break
# 4. Don't block the server thread — still-running workers self-complete
# via drain_done, or self-clean on error.
executor.shutdown(wait=False)
def _stream_chat_turn(
@@ -1539,15 +1468,11 @@ def handle_stream_message_objects(
)
def _build_model_display_name(override: LLMOverride) -> str:
def _build_model_display_name(override: LLMOverride | None) -> str:
"""Build a human-readable display name from an LLM override."""
if override.display_name:
return override.display_name
if override.model_version:
return override.model_version
if override.model_provider:
return override.model_provider
return "unknown"
if override is None:
return "unknown"
return override.display_name or override.model_version or "unknown"
def handle_multi_model_stream(

View File

@@ -286,11 +286,9 @@ USING_AWS_MANAGED_OPENSEARCH = (
os.environ.get("USING_AWS_MANAGED_OPENSEARCH", "").lower() == "true"
)
# Profiling adds some overhead to OpenSearch operations. This overhead is
# unknown right now. It is enabled by default so we can get useful logs for
# investigating slow queries. We may never disable it if the overhead is
# minimal.
# unknown right now. Defaults to True.
OPENSEARCH_PROFILING_DISABLED = (
os.environ.get("OPENSEARCH_PROFILING_DISABLED", "").lower() == "true"
os.environ.get("OPENSEARCH_PROFILING_DISABLED", "true").lower() == "true"
)
# Whether to disable match highlights for OpenSearch. Defaults to True for now
# as we investigate query performance.
@@ -942,9 +940,20 @@ CUSTOM_ANSWER_VALIDITY_CONDITIONS = json.loads(
)
VESPA_REQUEST_TIMEOUT = int(os.environ.get("VESPA_REQUEST_TIMEOUT") or "15")
# This is the timeout for the client side of the Vespa migration task. When
# exceeded, an exception is raised in our code. This value should be higher than
# VESPA_MIGRATION_SERVER_SIDE_REQUEST_TIMEOUT.
VESPA_MIGRATION_REQUEST_TIMEOUT_S = int(
os.environ.get("VESPA_MIGRATION_REQUEST_TIMEOUT_S") or "120"
)
# This is the timeout Vespa uses on the server side to know when to wrap up its
# traversal and try to report partial results. This differs from the client
# timeout above which raises an exception in our code when exceeded. This
# timeout allows Vespa to return gracefully. This value should be lower than
# VESPA_MIGRATION_REQUEST_TIMEOUT_S. Formatted as <number of seconds>s.
VESPA_MIGRATION_SERVER_SIDE_REQUEST_TIMEOUT = os.environ.get(
"VESPA_MIGRATION_SERVER_SIDE_REQUEST_TIMEOUT", "110s"
)
SYSTEM_RECURSION_LIMIT = int(os.environ.get("SYSTEM_RECURSION_LIMIT") or "1000")
@@ -1079,7 +1088,6 @@ POD_NAMESPACE = os.environ.get("POD_NAMESPACE")
DEV_MODE = os.environ.get("DEV_MODE", "").lower() == "true"
HOOK_ENABLED = os.environ.get("HOOK_ENABLED", "").lower() == "true"
INTEGRATION_TESTS_MODE = os.environ.get("INTEGRATION_TESTS_MODE", "").lower() == "true"

View File

@@ -212,6 +212,7 @@ class DocumentSource(str, Enum):
PRODUCTBOARD = "productboard"
FILE = "file"
CODA = "coda"
CANVAS = "canvas"
NOTION = "notion"
ZULIP = "zulip"
LINEAR = "linear"
@@ -672,6 +673,7 @@ DocumentSourceDescription: dict[DocumentSource, str] = {
DocumentSource.SLAB: "slab data",
DocumentSource.PRODUCTBOARD: "productboard data (boards, etc.)",
DocumentSource.FILE: "files",
DocumentSource.CANVAS: "canvas lms - courses, pages, assignments, and announcements",
DocumentSource.CODA: "coda - team workspace with docs, tables, and pages",
DocumentSource.NOTION: "notion data - a workspace that combines note-taking, \
project management, and collaboration tools into a single, customizable platform",

View File

@@ -0,0 +1,32 @@
"""
Permissioning / AccessControl logic for Canvas courses.
CE stub — returns None (no permissions). The EE implementation is loaded
at runtime via ``fetch_versioned_implementation``.
"""
from collections.abc import Callable
from typing import cast
from onyx.access.models import ExternalAccess
from onyx.connectors.canvas.client import CanvasApiClient
from onyx.utils.variable_functionality import fetch_versioned_implementation
from onyx.utils.variable_functionality import global_version
def get_course_permissions(
canvas_client: CanvasApiClient,
course_id: int,
) -> ExternalAccess | None:
if not global_version.is_ee_version():
return None
ee_get_course_permissions = cast(
Callable[[CanvasApiClient, int], ExternalAccess | None],
fetch_versioned_implementation(
"onyx.external_permissions.canvas.access",
"get_course_permissions",
),
)
return ee_get_course_permissions(canvas_client, course_id)

View File

@@ -2,6 +2,7 @@ from __future__ import annotations
import logging
import re
from collections.abc import Iterator
from typing import Any
from urllib.parse import urlparse
@@ -190,3 +191,22 @@ class CanvasApiClient:
if clean_endpoint:
final_url += "/" + clean_endpoint
return final_url
def paginate(
self,
endpoint: str,
params: dict[str, Any] | None = None,
) -> Iterator[list[Any]]:
"""Yield each page of results, following Link-header pagination.
Makes the first request with endpoint + params, then follows
next_url from Link headers for subsequent pages.
"""
response, next_url = self.get(endpoint, params=params)
while True:
if not response:
break
yield response
if not next_url:
break
response, next_url = self.get(full_url=next_url)

View File

@@ -1,17 +1,82 @@
from datetime import datetime
from datetime import timezone
from typing import Any
from typing import cast
from typing import Literal
from typing import NoReturn
from typing import TypeAlias
from pydantic import BaseModel
from retry import retry
from typing_extensions import override
from onyx.access.models import ExternalAccess
from onyx.configs.app_configs import INDEX_BATCH_SIZE
from onyx.configs.constants import DocumentSource
from onyx.connectors.canvas.access import get_course_permissions
from onyx.connectors.canvas.client import CanvasApiClient
from onyx.connectors.exceptions import ConnectorValidationError
from onyx.connectors.exceptions import CredentialExpiredError
from onyx.connectors.exceptions import InsufficientPermissionsError
from onyx.connectors.exceptions import UnexpectedValidationError
from onyx.connectors.interfaces import CheckpointedConnectorWithPermSync
from onyx.connectors.interfaces import CheckpointOutput
from onyx.connectors.interfaces import GenerateSlimDocumentOutput
from onyx.connectors.interfaces import SecondsSinceUnixEpoch
from onyx.connectors.interfaces import SlimConnectorWithPermSync
from onyx.connectors.models import ConnectorCheckpoint
from onyx.connectors.models import ConnectorMissingCredentialError
from onyx.connectors.models import Document
from onyx.connectors.models import ImageSection
from onyx.connectors.models import TextSection
from onyx.error_handling.exceptions import OnyxError
from onyx.file_processing.html_utils import parse_html_page_basic
from onyx.indexing.indexing_heartbeat import IndexingHeartbeatInterface
from onyx.utils.logger import setup_logger
logger = setup_logger()
def _handle_canvas_api_error(e: OnyxError) -> NoReturn:
"""Map Canvas API errors to connector framework exceptions."""
if e.status_code == 401:
raise CredentialExpiredError(
"Canvas API token is invalid or expired (HTTP 401)."
)
elif e.status_code == 403:
raise InsufficientPermissionsError(
"Canvas API token does not have sufficient permissions (HTTP 403)."
)
elif e.status_code == 429:
raise ConnectorValidationError(
"Canvas rate-limit exceeded (HTTP 429). Please try again later."
)
elif e.status_code >= 500:
raise UnexpectedValidationError(
f"Unexpected Canvas HTTP error (status={e.status_code}): {e}"
)
else:
raise ConnectorValidationError(
f"Canvas API error (status={e.status_code}): {e}"
)
class CanvasCourse(BaseModel):
id: int
name: str
course_code: str
created_at: str
workflow_state: str
name: str | None = None
course_code: str | None = None
created_at: str | None = None
workflow_state: str | None = None
@classmethod
def from_api(cls, payload: dict[str, Any]) -> "CanvasCourse":
return cls(
id=payload["id"],
name=payload.get("name"),
course_code=payload.get("course_code"),
created_at=payload.get("created_at"),
workflow_state=payload.get("workflow_state"),
)
class CanvasPage(BaseModel):
@@ -19,10 +84,22 @@ class CanvasPage(BaseModel):
url: str
title: str
body: str | None = None
created_at: str
updated_at: str
created_at: str | None = None
updated_at: str | None = None
course_id: int
@classmethod
def from_api(cls, payload: dict[str, Any], course_id: int) -> "CanvasPage":
return cls(
page_id=payload["page_id"],
url=payload["url"],
title=payload["title"],
body=payload.get("body"),
created_at=payload.get("created_at"),
updated_at=payload.get("updated_at"),
course_id=course_id,
)
class CanvasAssignment(BaseModel):
id: int
@@ -30,10 +107,23 @@ class CanvasAssignment(BaseModel):
description: str | None = None
html_url: str
course_id: int
created_at: str
updated_at: str
created_at: str | None = None
updated_at: str | None = None
due_at: str | None = None
@classmethod
def from_api(cls, payload: dict[str, Any], course_id: int) -> "CanvasAssignment":
return cls(
id=payload["id"],
name=payload["name"],
description=payload.get("description"),
html_url=payload["html_url"],
course_id=course_id,
created_at=payload.get("created_at"),
updated_at=payload.get("updated_at"),
due_at=payload.get("due_at"),
)
class CanvasAnnouncement(BaseModel):
id: int
@@ -43,6 +133,17 @@ class CanvasAnnouncement(BaseModel):
posted_at: str | None = None
course_id: int
@classmethod
def from_api(cls, payload: dict[str, Any], course_id: int) -> "CanvasAnnouncement":
return cls(
id=payload["id"],
title=payload["title"],
message=payload.get("message"),
html_url=payload["html_url"],
posted_at=payload.get("posted_at"),
course_id=course_id,
)
CanvasStage: TypeAlias = Literal["pages", "assignments", "announcements"]
@@ -72,3 +173,286 @@ class CanvasConnectorCheckpoint(ConnectorCheckpoint):
self.current_course_index += 1
self.stage = "pages"
self.next_url = None
class CanvasConnector(
CheckpointedConnectorWithPermSync[CanvasConnectorCheckpoint],
SlimConnectorWithPermSync,
):
def __init__(
self,
canvas_base_url: str,
batch_size: int = INDEX_BATCH_SIZE,
) -> None:
self.canvas_base_url = canvas_base_url.rstrip("/").removesuffix("/api/v1")
self.batch_size = batch_size
self._canvas_client: CanvasApiClient | None = None
self._course_permissions_cache: dict[int, ExternalAccess | None] = {}
@property
def canvas_client(self) -> CanvasApiClient:
if self._canvas_client is None:
raise ConnectorMissingCredentialError("Canvas")
return self._canvas_client
def _get_course_permissions(self, course_id: int) -> ExternalAccess | None:
"""Get course permissions with caching."""
if course_id not in self._course_permissions_cache:
self._course_permissions_cache[course_id] = get_course_permissions(
canvas_client=self.canvas_client,
course_id=course_id,
)
return self._course_permissions_cache[course_id]
@retry(tries=3, delay=1, backoff=2)
def _list_courses(self) -> list[CanvasCourse]:
"""Fetch all courses accessible to the authenticated user."""
logger.debug("Fetching Canvas courses")
courses: list[CanvasCourse] = []
for page in self.canvas_client.paginate(
"courses", params={"per_page": "100", "state[]": "available"}
):
courses.extend(CanvasCourse.from_api(c) for c in page)
return courses
@retry(tries=3, delay=1, backoff=2)
def _list_pages(self, course_id: int) -> list[CanvasPage]:
"""Fetch all pages for a given course."""
logger.debug(f"Fetching pages for course {course_id}")
pages: list[CanvasPage] = []
for page in self.canvas_client.paginate(
f"courses/{course_id}/pages",
params={"per_page": "100", "include[]": "body", "published": "true"},
):
pages.extend(CanvasPage.from_api(p, course_id=course_id) for p in page)
return pages
@retry(tries=3, delay=1, backoff=2)
def _list_assignments(self, course_id: int) -> list[CanvasAssignment]:
"""Fetch all assignments for a given course."""
logger.debug(f"Fetching assignments for course {course_id}")
assignments: list[CanvasAssignment] = []
for page in self.canvas_client.paginate(
f"courses/{course_id}/assignments",
params={"per_page": "100", "published": "true"},
):
assignments.extend(
CanvasAssignment.from_api(a, course_id=course_id) for a in page
)
return assignments
@retry(tries=3, delay=1, backoff=2)
def _list_announcements(self, course_id: int) -> list[CanvasAnnouncement]:
"""Fetch all announcements for a given course."""
logger.debug(f"Fetching announcements for course {course_id}")
announcements: list[CanvasAnnouncement] = []
for page in self.canvas_client.paginate(
"announcements",
params={
"per_page": "100",
"context_codes[]": f"course_{course_id}",
"active_only": "true",
},
):
announcements.extend(
CanvasAnnouncement.from_api(a, course_id=course_id) for a in page
)
return announcements
def _build_document(
self,
doc_id: str,
link: str,
text: str,
semantic_identifier: str,
doc_updated_at: datetime | None,
course_id: int,
doc_type: str,
) -> Document:
"""Build a Document with standard Canvas fields."""
return Document(
id=doc_id,
sections=cast(
list[TextSection | ImageSection],
[TextSection(link=link, text=text)],
),
source=DocumentSource.CANVAS,
semantic_identifier=semantic_identifier,
doc_updated_at=doc_updated_at,
metadata={"course_id": str(course_id), "type": doc_type},
)
def _convert_page_to_document(self, page: CanvasPage) -> Document:
"""Convert a Canvas page to a Document."""
link = f"{self.canvas_base_url}/courses/{page.course_id}/pages/{page.url}"
text_parts = [page.title]
body_text = parse_html_page_basic(page.body) if page.body else ""
if body_text:
text_parts.append(body_text)
doc_updated_at = (
datetime.fromisoformat(page.updated_at.replace("Z", "+00:00")).astimezone(
timezone.utc
)
if page.updated_at
else None
)
document = self._build_document(
doc_id=f"canvas-page-{page.course_id}-{page.page_id}",
link=link,
text="\n\n".join(text_parts),
semantic_identifier=page.title or f"Page {page.page_id}",
doc_updated_at=doc_updated_at,
course_id=page.course_id,
doc_type="page",
)
return document
def _convert_assignment_to_document(self, assignment: CanvasAssignment) -> Document:
"""Convert a Canvas assignment to a Document."""
text_parts = [assignment.name]
desc_text = (
parse_html_page_basic(assignment.description)
if assignment.description
else ""
)
if desc_text:
text_parts.append(desc_text)
if assignment.due_at:
due_dt = datetime.fromisoformat(
assignment.due_at.replace("Z", "+00:00")
).astimezone(timezone.utc)
text_parts.append(f"Due: {due_dt.strftime('%B %d, %Y %H:%M UTC')}")
doc_updated_at = (
datetime.fromisoformat(
assignment.updated_at.replace("Z", "+00:00")
).astimezone(timezone.utc)
if assignment.updated_at
else None
)
document = self._build_document(
doc_id=f"canvas-assignment-{assignment.course_id}-{assignment.id}",
link=assignment.html_url,
text="\n\n".join(text_parts),
semantic_identifier=assignment.name or f"Assignment {assignment.id}",
doc_updated_at=doc_updated_at,
course_id=assignment.course_id,
doc_type="assignment",
)
return document
def _convert_announcement_to_document(
self, announcement: CanvasAnnouncement
) -> Document:
"""Convert a Canvas announcement to a Document."""
text_parts = [announcement.title]
msg_text = (
parse_html_page_basic(announcement.message) if announcement.message else ""
)
if msg_text:
text_parts.append(msg_text)
doc_updated_at = (
datetime.fromisoformat(
announcement.posted_at.replace("Z", "+00:00")
).astimezone(timezone.utc)
if announcement.posted_at
else None
)
document = self._build_document(
doc_id=f"canvas-announcement-{announcement.course_id}-{announcement.id}",
link=announcement.html_url,
text="\n\n".join(text_parts),
semantic_identifier=announcement.title or f"Announcement {announcement.id}",
doc_updated_at=doc_updated_at,
course_id=announcement.course_id,
doc_type="announcement",
)
return document
@override
def load_credentials(self, credentials: dict[str, Any]) -> dict[str, Any] | None:
"""Load and validate Canvas credentials."""
access_token = credentials.get("canvas_access_token")
if not access_token:
raise ConnectorMissingCredentialError("Canvas")
try:
client = CanvasApiClient(
bearer_token=access_token,
canvas_base_url=self.canvas_base_url,
)
client.get("courses", params={"per_page": "1"})
except ValueError as e:
raise ConnectorValidationError(f"Invalid Canvas base URL: {e}")
except OnyxError as e:
_handle_canvas_api_error(e)
self._canvas_client = client
return None
@override
def validate_connector_settings(self) -> None:
"""Validate Canvas connector settings by testing API access."""
try:
self.canvas_client.get("courses", params={"per_page": "1"})
logger.info("Canvas connector settings validated successfully")
except OnyxError as e:
_handle_canvas_api_error(e)
except ConnectorMissingCredentialError:
raise
except Exception as exc:
raise UnexpectedValidationError(
f"Unexpected error during Canvas settings validation: {exc}"
)
@override
def load_from_checkpoint(
self,
start: SecondsSinceUnixEpoch,
end: SecondsSinceUnixEpoch,
checkpoint: CanvasConnectorCheckpoint,
) -> CheckpointOutput[CanvasConnectorCheckpoint]:
# TODO(benwu408): implemented in PR3 (checkpoint)
raise NotImplementedError
@override
def load_from_checkpoint_with_perm_sync(
self,
start: SecondsSinceUnixEpoch,
end: SecondsSinceUnixEpoch,
checkpoint: CanvasConnectorCheckpoint,
) -> CheckpointOutput[CanvasConnectorCheckpoint]:
# TODO(benwu408): implemented in PR3 (checkpoint)
raise NotImplementedError
@override
def build_dummy_checkpoint(self) -> CanvasConnectorCheckpoint:
# TODO(benwu408): implemented in PR3 (checkpoint)
raise NotImplementedError
@override
def validate_checkpoint_json(
self, checkpoint_json: str
) -> CanvasConnectorCheckpoint:
# TODO(benwu408): implemented in PR3 (checkpoint)
raise NotImplementedError
@override
def retrieve_all_slim_docs_perm_sync(
self,
start: SecondsSinceUnixEpoch | None = None,
end: SecondsSinceUnixEpoch | None = None,
callback: IndexingHeartbeatInterface | None = None,
) -> GenerateSlimDocumentOutput:
# TODO(benwu408): implemented in PR4 (perm sync)
raise NotImplementedError

View File

@@ -11,11 +11,13 @@ from discord import Client
from discord.channel import TextChannel
from discord.channel import Thread
from discord.enums import MessageType
from discord.errors import LoginFailure
from discord.flags import Intents
from discord.message import Message as DiscordMessage
from onyx.configs.app_configs import INDEX_BATCH_SIZE
from onyx.configs.constants import DocumentSource
from onyx.connectors.exceptions import CredentialInvalidError
from onyx.connectors.interfaces import GenerateDocumentsOutput
from onyx.connectors.interfaces import LoadConnector
from onyx.connectors.interfaces import PollConnector
@@ -209,8 +211,19 @@ def _manage_async_retrieval(
intents = Intents.default()
intents.message_content = True
async with Client(intents=intents) as discord_client:
asyncio.create_task(discord_client.start(token))
await discord_client.wait_until_ready()
start_task = asyncio.create_task(discord_client.start(token))
ready_task = asyncio.create_task(discord_client.wait_until_ready())
done, _ = await asyncio.wait(
{start_task, ready_task},
return_when=asyncio.FIRST_COMPLETED,
)
# start() runs indefinitely once connected, so it only lands
# in `done` when login/connection failed — propagate the error.
if start_task in done:
ready_task.cancel()
start_task.result()
filtered_channels: list[TextChannel] = await _fetch_filtered_channels(
discord_client=discord_client,
@@ -276,6 +289,19 @@ class DiscordConnector(PollConnector, LoadConnector):
self._discord_bot_token = credentials["discord_bot_token"]
return None
def validate_connector_settings(self) -> None:
loop = asyncio.new_event_loop()
try:
client = Client(intents=Intents.default())
try:
loop.run_until_complete(client.login(self.discord_bot_token))
except LoginFailure as e:
raise CredentialInvalidError(f"Invalid Discord bot token: {e}")
finally:
loop.run_until_complete(client.close())
finally:
loop.close()
def _manage_doc_batching(
self,
start: datetime | None = None,

View File

@@ -8,7 +8,6 @@ from collections.abc import Generator
from collections.abc import Iterator
from datetime import datetime
from enum import Enum
from functools import partial
from typing import Any
from typing import cast
from typing import Protocol
@@ -1487,134 +1486,113 @@ class GoogleDriveConnector(
end=end,
)
def _extract_docs_from_google_drive(
def _convert_retrieved_files_to_documents(
self,
drive_files_iter: Iterator[RetrievedDriveFile],
checkpoint: GoogleDriveCheckpoint,
start: SecondsSinceUnixEpoch | None,
end: SecondsSinceUnixEpoch | None,
include_permissions: bool,
) -> Iterator[Document | ConnectorFailure | HierarchyNode]:
"""
Retrieves and converts Google Drive files to documents.
Also yields HierarchyNode objects for ancestor folders.
Converts retrieved files to documents, yielding HierarchyNode
objects for ancestor folders before the converted documents.
"""
field_type = (
DriveFileFieldType.WITH_PERMISSIONS
if include_permissions or self.exclude_domain_link_only
else DriveFileFieldType.STANDARD
permission_sync_context = (
PermissionSyncContext(
primary_admin_email=self.primary_admin_email,
google_domain=self.google_domain,
)
if include_permissions
else None
)
try:
# Build permission sync context if needed
permission_sync_context = (
PermissionSyncContext(
primary_admin_email=self.primary_admin_email,
google_domain=self.google_domain,
)
if include_permissions
else None
files_batch: list[RetrievedDriveFile] = []
for retrieved_file in drive_files_iter:
if self.exclude_domain_link_only and has_link_only_permission(
retrieved_file.drive_file
):
continue
if retrieved_file.error is None:
files_batch.append(retrieved_file)
continue
failure_stage = retrieved_file.completion_stage.value
failure_message = f"retrieval failure during stage: {failure_stage},"
failure_message += f"user: {retrieved_file.user_email},"
failure_message += f"parent drive/folder: {retrieved_file.parent_id},"
failure_message += f"error: {retrieved_file.error}"
logger.error(failure_message)
yield ConnectorFailure(
failed_entity=EntityFailure(
entity_id=retrieved_file.drive_file.get("id", failure_stage),
),
failure_message=failure_message,
exception=retrieved_file.error,
)
# Prepare a partial function with the credentials and admin email
convert_func = partial(
convert_drive_item_to_document,
new_ancestors = self._get_new_ancestors_for_files(
files=files_batch,
seen_hierarchy_node_raw_ids=checkpoint.seen_hierarchy_node_raw_ids,
fully_walked_hierarchy_node_raw_ids=checkpoint.fully_walked_hierarchy_node_raw_ids,
permission_sync_context=permission_sync_context,
add_prefix=True,
)
if new_ancestors:
logger.debug(f"Yielding {len(new_ancestors)} new hierarchy nodes")
yield from new_ancestors
func_with_args = [
(
self._convert_retrieved_file_to_document,
(retrieved_file, permission_sync_context),
)
for retrieved_file in files_batch
]
raw_results = cast(
list[Document | ConnectorFailure | None],
run_functions_tuples_in_parallel(func_with_args, max_workers=8),
)
results: list[Document | ConnectorFailure] = [
r for r in raw_results if r is not None
]
logger.debug(f"batch has {len(results)} docs or failures")
yield from results
checkpoint.retrieved_folder_and_drive_ids = self._retrieved_folder_and_drive_ids
def _convert_retrieved_file_to_document(
self,
retrieved_file: RetrievedDriveFile,
permission_sync_context: PermissionSyncContext | None,
) -> Document | ConnectorFailure | None:
"""
Converts a single retrieved file to a document.
"""
try:
return convert_drive_item_to_document(
self.creds,
self.allow_images,
self.size_threshold,
permission_sync_context,
[retrieved_file.user_email, self.primary_admin_email]
+ get_file_owners(retrieved_file.drive_file, self.primary_admin_email),
retrieved_file.drive_file,
)
# Fetch files in batches
batches_complete = 0
files_batch: list[RetrievedDriveFile] = []
def _yield_batch(
files_batch: list[RetrievedDriveFile],
) -> Iterator[Document | ConnectorFailure | HierarchyNode]:
nonlocal batches_complete
# First, yield any new ancestor hierarchy nodes
new_ancestors = self._get_new_ancestors_for_files(
files=files_batch,
seen_hierarchy_node_raw_ids=checkpoint.seen_hierarchy_node_raw_ids,
fully_walked_hierarchy_node_raw_ids=checkpoint.fully_walked_hierarchy_node_raw_ids,
permission_sync_context=permission_sync_context,
add_prefix=True, # Indexing path - prefix here
)
if new_ancestors:
logger.debug(
f"Yielding {len(new_ancestors)} new hierarchy nodes for batch {batches_complete}"
)
yield from new_ancestors
# Process the batch using run_functions_tuples_in_parallel
func_with_args = [
(
convert_func,
(
[file.user_email, self.primary_admin_email]
+ get_file_owners(
file.drive_file, self.primary_admin_email
),
file.drive_file,
),
)
for file in files_batch
]
results = cast(
list[Document | ConnectorFailure | None],
run_functions_tuples_in_parallel(func_with_args, max_workers=8),
)
logger.debug(
f"finished processing batch {batches_complete} with {len(results)} results"
)
docs_and_failures = [result for result in results if result is not None]
logger.debug(
f"batch {batches_complete} has {len(docs_and_failures)} docs or failures"
)
if docs_and_failures:
yield from docs_and_failures
batches_complete += 1
logger.debug(f"finished yielding batch {batches_complete}")
for retrieved_file in self._fetch_drive_items(
field_type=field_type,
checkpoint=checkpoint,
start=start,
end=end,
):
if self.exclude_domain_link_only and has_link_only_permission(
retrieved_file.drive_file
):
continue
if retrieved_file.error is None:
files_batch.append(retrieved_file)
continue
# handle retrieval errors
failure_stage = retrieved_file.completion_stage.value
failure_message = f"retrieval failure during stage: {failure_stage},"
failure_message += f"user: {retrieved_file.user_email},"
failure_message += f"parent drive/folder: {retrieved_file.parent_id},"
failure_message += f"error: {retrieved_file.error}"
logger.error(failure_message)
yield ConnectorFailure(
failed_entity=EntityFailure(
entity_id=failure_stage,
),
failure_message=failure_message,
exception=retrieved_file.error,
)
yield from _yield_batch(files_batch)
checkpoint.retrieved_folder_and_drive_ids = (
self._retrieved_folder_and_drive_ids
)
except Exception as e:
logger.exception(f"Error extracting documents from Google Drive: {e}")
raise e
logger.exception(
f"Error extracting document: "
f"{retrieved_file.drive_file.get('name')} from Google Drive"
)
return ConnectorFailure(
failed_entity=EntityFailure(
entity_id=retrieved_file.drive_file.get("id", "unknown"),
),
failure_message=(
f"Error extracting document: "
f"{retrieved_file.drive_file.get('name')}"
),
exception=e,
)
def _load_from_checkpoint(
self,
@@ -1638,8 +1616,19 @@ class GoogleDriveConnector(
checkpoint = copy.deepcopy(checkpoint)
self._retrieved_folder_and_drive_ids = checkpoint.retrieved_folder_and_drive_ids
try:
yield from self._extract_docs_from_google_drive(
checkpoint, start, end, include_permissions
field_type = (
DriveFileFieldType.WITH_PERMISSIONS
if include_permissions or self.exclude_domain_link_only
else DriveFileFieldType.STANDARD
)
drive_files_iter = self._fetch_drive_items(
field_type=field_type,
checkpoint=checkpoint,
start=start,
end=end,
)
yield from self._convert_retrieved_files_to_documents(
drive_files_iter, checkpoint, include_permissions
)
except Exception as e:
if MISSING_SCOPES_ERROR_STR in str(e):

View File

@@ -4,6 +4,8 @@ from datetime import datetime
from datetime import timezone
from enum import Enum
from typing import cast
from urllib.parse import parse_qs
from urllib.parse import urlparse
from googleapiclient.discovery import Resource # type: ignore
from googleapiclient.errors import HttpError # type: ignore
@@ -496,3 +498,41 @@ def get_root_folder_id(service: Resource) -> str:
.get(fileId="root", fields=GoogleFields.ID.value)
.execute()[GoogleFields.ID.value]
)
def _extract_file_id_from_web_view_link(web_view_link: str) -> str:
parsed = urlparse(web_view_link)
path_parts = [part for part in parsed.path.split("/") if part]
if "d" in path_parts:
idx = path_parts.index("d")
if idx + 1 < len(path_parts):
return path_parts[idx + 1]
query_params = parse_qs(parsed.query)
for key in ("id", "fileId"):
value = query_params.get(key)
if value and value[0]:
return value[0]
raise ValueError(
f"Unable to extract Drive file id from webViewLink: {web_view_link}"
)
def get_file_by_web_view_link(
service: GoogleDriveService,
web_view_link: str,
fields: str,
) -> GoogleDriveFileType:
"""Retrieve a Google Drive file using its webViewLink."""
file_id = _extract_file_id_from_web_view_link(web_view_link)
return (
service.files()
.get(
fileId=file_id,
supportsAllDrives=True,
fields=fields,
)
.execute()
)

View File

@@ -72,6 +72,10 @@ CONNECTOR_CLASS_MAP = {
module_path="onyx.connectors.coda.connector",
class_name="CodaConnector",
),
DocumentSource.CANVAS: ConnectorMapping(
module_path="onyx.connectors.canvas.connector",
class_name="CanvasConnector",
),
DocumentSource.NOTION: ConnectorMapping(
module_path="onyx.connectors.notion.connector",
class_name="NotionConnector",

View File

@@ -4,7 +4,6 @@ from fastapi_users.password import PasswordHelper
from sqlalchemy import select
from sqlalchemy.ext.asyncio import AsyncSession
from sqlalchemy.orm import joinedload
from sqlalchemy.orm import selectinload
from sqlalchemy.orm import Session
from onyx.auth.api_key import ApiKeyDescriptor
@@ -55,7 +54,6 @@ async def fetch_user_for_api_key(
select(User)
.join(ApiKey, ApiKey.user_id == User.id)
.where(ApiKey.hashed_api_key == hashed_api_key)
.options(selectinload(User.memories))
)

View File

@@ -13,7 +13,6 @@ from sqlalchemy import func
from sqlalchemy import Select
from sqlalchemy.ext.asyncio import AsyncSession
from sqlalchemy.future import select
from sqlalchemy.orm import selectinload
from sqlalchemy.orm import Session
from onyx.auth.schemas import UserRole
@@ -98,11 +97,6 @@ async def get_user_count(only_admin_users: bool = False) -> int:
# Need to override this because FastAPI Users doesn't give flexibility for backend field creation logic in OAuth flow
class SQLAlchemyUserAdminDB(SQLAlchemyUserDatabase[UP, ID]):
async def _get_user(self, statement: Select) -> UP | None:
statement = statement.options(selectinload(User.memories))
results = await self.session.execute(statement)
return results.unique().scalar_one_or_none()
async def create(
self,
create_dict: Dict[str, Any],

View File

@@ -8,7 +8,6 @@ from uuid import UUID
from fastapi import HTTPException
from sqlalchemy import delete
from sqlalchemy import desc
from sqlalchemy import exists
from sqlalchemy import func
from sqlalchemy import nullsfirst
from sqlalchemy import or_
@@ -132,32 +131,47 @@ def get_chat_sessions_by_user(
if before is not None:
stmt = stmt.where(ChatSession.time_updated < before)
if limit:
stmt = stmt.limit(limit)
if project_id is not None:
stmt = stmt.where(ChatSession.project_id == project_id)
elif only_non_project_chats:
stmt = stmt.where(ChatSession.project_id.is_(None))
if not include_failed_chats:
non_system_message_exists_subq = (
exists()
.where(ChatMessage.chat_session_id == ChatSession.id)
.where(ChatMessage.message_type != MessageType.SYSTEM)
.correlate(ChatSession)
)
# Leeway for newly created chats that don't have messages yet
time = datetime.now(timezone.utc) - timedelta(minutes=5)
recently_created = ChatSession.time_created >= time
stmt = stmt.where(or_(non_system_message_exists_subq, recently_created))
# When filtering out failed chats, we apply the limit in Python after
# filtering rather than in SQL, since the post-filter may remove rows.
if limit and include_failed_chats:
stmt = stmt.limit(limit)
result = db_session.execute(stmt)
chat_sessions = result.scalars().all()
chat_sessions = list(result.scalars().all())
return list(chat_sessions)
if not include_failed_chats and chat_sessions:
# Filter out "failed" sessions (those with only SYSTEM messages)
# using a separate efficient query instead of a correlated EXISTS
# subquery, which causes full sequential scans of chat_message.
leeway = datetime.now(timezone.utc) - timedelta(minutes=5)
session_ids = [cs.id for cs in chat_sessions if cs.time_created < leeway]
if session_ids:
valid_session_ids_stmt = (
select(ChatMessage.chat_session_id)
.where(ChatMessage.chat_session_id.in_(session_ids))
.where(ChatMessage.message_type != MessageType.SYSTEM)
.distinct()
)
valid_session_ids = set(
db_session.execute(valid_session_ids_stmt).scalars().all()
)
chat_sessions = [
cs
for cs in chat_sessions
if cs.time_created >= leeway or cs.id in valid_session_ids
]
if limit:
chat_sessions = chat_sessions[:limit]
return chat_sessions
def delete_orphaned_search_docs(db_session: Session) -> None:
@@ -694,8 +708,7 @@ def set_preferred_response(
)
if assistant_msg.parent_message_id != user_message_id:
raise ValueError(
f"Assistant message {preferred_assistant_message_id} is not a child "
f"of user message {user_message_id}"
f"Assistant message {preferred_assistant_message_id} is not a child of user message {user_message_id}"
)
user_msg.preferred_response_id = preferred_assistant_message_id

View File

@@ -215,6 +215,7 @@ class UserFileStatus(str, PyEnum):
PROCESSING = "PROCESSING"
INDEXING = "INDEXING"
COMPLETED = "COMPLETED"
SKIPPED = "SKIPPED"
FAILED = "FAILED"
CANCELED = "CANCELED"
DELETING = "DELETING"

View File

@@ -8,7 +8,6 @@ from uuid import UUID
from sqlalchemy import select
from sqlalchemy import update
from sqlalchemy.ext.asyncio import AsyncSession
from sqlalchemy.orm import selectinload
from sqlalchemy.orm import Session
from onyx.auth.pat import build_displayable_pat
@@ -47,7 +46,6 @@ async def fetch_user_for_pat(
(PersonalAccessToken.expires_at.is_(None))
| (PersonalAccessToken.expires_at > now)
)
.options(selectinload(User.memories))
)
if not user:
return None

View File

@@ -7,6 +7,7 @@ from fastapi import HTTPException
from fastapi import UploadFile
from pydantic import BaseModel
from pydantic import ConfigDict
from pydantic import Field
from sqlalchemy import func
from sqlalchemy.orm import Session
from starlette.background import BackgroundTasks
@@ -17,6 +18,7 @@ from onyx.configs.constants import FileOrigin
from onyx.configs.constants import OnyxCeleryPriority
from onyx.configs.constants import OnyxCeleryQueues
from onyx.configs.constants import OnyxCeleryTask
from onyx.db.enums import UserFileStatus
from onyx.db.models import Project__UserFile
from onyx.db.models import User
from onyx.db.models import UserFile
@@ -34,9 +36,19 @@ class CategorizedFilesResult(BaseModel):
user_files: list[UserFile]
rejected_files: list[RejectedFile]
id_to_temp_id: dict[str, str]
# Filenames that should be stored but not indexed.
skip_indexing_filenames: set[str] = Field(default_factory=set)
# Allow SQLAlchemy ORM models inside this result container
model_config = ConfigDict(arbitrary_types_allowed=True)
@property
def indexable_files(self) -> list[UserFile]:
return [
uf
for uf in self.user_files
if (uf.name or "") not in self.skip_indexing_filenames
]
def build_hashed_file_key(file: UploadFile) -> str:
name_prefix = (file.filename or "")[:50]
@@ -70,6 +82,7 @@ def create_user_files(
)
if new_temp_id is not None:
id_to_temp_id[str(new_id)] = new_temp_id
should_skip = (file.filename or "") in categorized_files.skip_indexing
new_file = UserFile(
id=new_id,
user_id=user.id,
@@ -81,6 +94,7 @@ def create_user_files(
link_url=link_url,
content_type=file.content_type,
file_type=file.content_type,
status=UserFileStatus.SKIPPED if should_skip else UserFileStatus.PROCESSING,
last_accessed_at=datetime.datetime.now(datetime.timezone.utc),
)
# Persist the UserFile first to satisfy FK constraints for association table
@@ -98,6 +112,7 @@ def create_user_files(
user_files=user_files,
rejected_files=rejected_files,
id_to_temp_id=id_to_temp_id,
skip_indexing_filenames=categorized_files.skip_indexing,
)
@@ -123,6 +138,7 @@ def upload_files_to_user_files_with_indexing(
user_files = categorized_files_result.user_files
rejected_files = categorized_files_result.rejected_files
id_to_temp_id = categorized_files_result.id_to_temp_id
indexable_files = categorized_files_result.indexable_files
# Trigger per-file processing immediately for the current tenant
tenant_id = get_current_tenant_id()
for rejected_file in rejected_files:
@@ -134,12 +150,12 @@ def upload_files_to_user_files_with_indexing(
from onyx.background.task_utils import drain_processing_loop
background_tasks.add_task(drain_processing_loop, tenant_id)
for user_file in user_files:
for user_file in indexable_files:
logger.info(f"Queued in-process processing for user_file_id={user_file.id}")
else:
from onyx.background.celery.versioned_apps.client import app as client_app
for user_file in user_files:
for user_file in indexable_files:
task = client_app.send_task(
OnyxCeleryTask.PROCESS_SINGLE_USER_FILE,
kwargs={"user_file_id": user_file.id, "tenant_id": tenant_id},
@@ -155,6 +171,7 @@ def upload_files_to_user_files_with_indexing(
user_files=user_files,
rejected_files=rejected_files,
id_to_temp_id=id_to_temp_id,
skip_indexing_filenames=categorized_files_result.skip_indexing_filenames,
)

View File

@@ -229,7 +229,9 @@ def get_memories_for_user(
user_id: UUID,
db_session: Session,
) -> Sequence[Memory]:
return db_session.scalars(select(Memory).where(Memory.user_id == user_id)).all()
return db_session.scalars(
select(Memory).where(Memory.user_id == user_id).order_by(Memory.id.desc())
).all()
def update_user_pinned_assistants(

View File

@@ -932,7 +932,7 @@ class OpenSearchIndexClient(OpenSearchClient):
def search_for_document_ids(
self,
body: dict[str, Any],
search_type: OpenSearchSearchType = OpenSearchSearchType.DOCUMENT_IDS,
search_type: OpenSearchSearchType = OpenSearchSearchType.UNKNOWN,
) -> list[str]:
"""Searches the index and returns only document chunk IDs.

View File

@@ -37,10 +37,10 @@ M = 32 # Set relatively high for better accuracy.
# we have a much higher chance of all 10 of the final desired docs showing up
# and getting scored. In worse situations, the final 10 docs don't even show up
# as the final 10 (worse than just a miss at the reranking step).
# Defaults to 100 for now. Initially this defaulted to 750 but we were seeing
# poor search performance.
# Defaults to 500 for now. Initially this defaulted to 750 but we were seeing
# poor search performance; bumped from 100 to 500 to improve recall.
DEFAULT_NUM_HYBRID_SUBQUERY_CANDIDATES = int(
os.environ.get("DEFAULT_NUM_HYBRID_SUBQUERY_CANDIDATES", 100)
os.environ.get("DEFAULT_NUM_HYBRID_SUBQUERY_CANDIDATES", 500)
)
# Number of vectors to examine to decide the top k neighbors for the HNSW
@@ -60,8 +60,7 @@ class OpenSearchSearchType(str, Enum):
KEYWORD = "keyword"
SEMANTIC = "semantic"
RANDOM = "random"
ID_RETRIEVAL = "id_retrieval"
DOCUMENT_IDS = "document_ids"
DOC_ID_RETRIEVAL = "doc_id_retrieval"
UNKNOWN = "unknown"

View File

@@ -928,7 +928,7 @@ class OpenSearchDocumentIndex(DocumentIndex):
search_hits = self._client.search(
body=query_body,
search_pipeline_id=None,
search_type=OpenSearchSearchType.ID_RETRIEVAL,
search_type=OpenSearchSearchType.DOC_ID_RETRIEVAL,
)
inference_chunks_uncleaned: list[InferenceChunkUncleaned] = [
_convert_retrieved_opensearch_chunk_to_inference_chunk_uncleaned(

View File

@@ -20,6 +20,7 @@ from onyx.background.celery.tasks.opensearch_migration.transformer import (
from onyx.configs.app_configs import LOG_VESPA_TIMING_INFORMATION
from onyx.configs.app_configs import VESPA_LANGUAGE_OVERRIDE
from onyx.configs.app_configs import VESPA_MIGRATION_REQUEST_TIMEOUT_S
from onyx.configs.app_configs import VESPA_MIGRATION_SERVER_SIDE_REQUEST_TIMEOUT
from onyx.context.search.models import IndexFilters
from onyx.context.search.models import InferenceChunkUncleaned
from onyx.document_index.interfaces import VespaChunkRequest
@@ -335,6 +336,11 @@ def get_all_chunks_paginated(
"format.tensors": "short-value",
"slices": total_slices,
"sliceId": slice_id,
# When exceeded, Vespa should return gracefully with partial
# results. Even if no hits are returned, Vespa should still return a
# new continuation token representing a new spot in the linear
# traversal.
"timeout": VESPA_MIGRATION_SERVER_SIDE_REQUEST_TIMEOUT,
}
if continuation_token is not None:
params["continuation"] = continuation_token
@@ -343,6 +349,9 @@ def get_all_chunks_paginated(
start_time = time.monotonic()
try:
with get_vespa_http_client(
# When exceeded, an exception is raised in our code. No progress
# is saved, and the task will retry this spot in the traversal
# later.
timeout=VESPA_MIGRATION_REQUEST_TIMEOUT_S
) as http_client:
response = http_client.get(url, params=params)

View File

@@ -1,3 +1,4 @@
import csv
import gc
import io
import json
@@ -19,6 +20,7 @@ from zipfile import BadZipFile
import chardet
import openpyxl
from openpyxl.worksheet.worksheet import Worksheet
from PIL import Image
from onyx.configs.constants import ONYX_METADATA_FILENAME
@@ -353,6 +355,94 @@ def pptx_to_text(file: IO[Any], file_name: str = "") -> str:
return presentation.markdown
def _worksheet_to_matrix(
worksheet: Worksheet,
) -> list[list[str]]:
"""
Converts a singular worksheet to a matrix of values
"""
rows: list[list[str]] = []
for worksheet_row in worksheet.iter_rows(min_row=1, values_only=True):
row = ["" if cell is None else str(cell) for cell in worksheet_row]
rows.append(row)
return rows
def _clean_worksheet_matrix(matrix: list[list[str]]) -> list[list[str]]:
"""
Cleans a worksheet matrix by removing rows if there are N consecutive empty
rows and removing cols if there are M consecutive empty columns
"""
MAX_EMPTY_ROWS = 2 # Runs longer than this are capped to max_empty; shorter runs are preserved as-is
MAX_EMPTY_COLS = 2
# Row cleanup
matrix = _remove_empty_runs(matrix, max_empty=MAX_EMPTY_ROWS)
if not matrix:
return matrix
# Column cleanup — determine which columns to keep without transposing.
num_cols = len(matrix[0])
keep_cols = _columns_to_keep(matrix, num_cols, max_empty=MAX_EMPTY_COLS)
if len(keep_cols) < num_cols:
matrix = [[row[c] for c in keep_cols] for row in matrix]
return matrix
def _columns_to_keep(
matrix: list[list[str]], num_cols: int, max_empty: int
) -> list[int]:
"""Return the indices of columns to keep after removing empty-column runs.
Uses the same logic as ``_remove_empty_runs`` but operates on column
indices so no transpose is needed.
"""
kept: list[int] = []
empty_buffer: list[int] = []
for col_idx in range(num_cols):
col_is_empty = all(not row[col_idx] for row in matrix)
if col_is_empty:
empty_buffer.append(col_idx)
else:
kept.extend(empty_buffer[:max_empty])
kept.append(col_idx)
empty_buffer = []
return kept
def _remove_empty_runs(
rows: list[list[str]],
max_empty: int,
) -> list[list[str]]:
"""Removes entire runs of empty rows when the run length exceeds max_empty.
Leading empty runs are capped to max_empty, just like interior runs.
Trailing empty rows are always dropped since there is no subsequent
non-empty row to flush them.
"""
result: list[list[str]] = []
empty_buffer: list[list[str]] = []
for row in rows:
# Check if empty
if not any(row):
if len(empty_buffer) < max_empty:
empty_buffer.append(row)
else:
# Add upto max empty rows onto the result - that's what we allow
result.extend(empty_buffer[:max_empty])
# Add the new non-empty row
result.append(row)
empty_buffer = []
return result
def xlsx_to_text(file: IO[Any], file_name: str = "") -> str:
# TODO: switch back to this approach in a few months when markitdown
# fixes their handling of excel files
@@ -391,30 +481,15 @@ def xlsx_to_text(file: IO[Any], file_name: str = "") -> str:
f"Failed to extract text from {file_name or 'xlsx file'}. This happens due to a bug in openpyxl. {e}"
)
return ""
raise e
raise
text_content = []
for sheet in workbook.worksheets:
rows = []
num_empty_consecutive_rows = 0
for row in sheet.iter_rows(min_row=1, values_only=True):
row_str = ",".join(str(cell or "") for cell in row)
# Only add the row if there are any values in the cells
if len(row_str) >= len(row):
rows.append(row_str)
num_empty_consecutive_rows = 0
else:
num_empty_consecutive_rows += 1
if num_empty_consecutive_rows > 100:
# handle massive excel sheets with mostly empty cells
logger.warning(
f"Found {num_empty_consecutive_rows} empty rows in {file_name}, skipping rest of file"
)
break
sheet_str = "\n".join(rows)
text_content.append(sheet_str)
sheet_matrix = _clean_worksheet_matrix(_worksheet_to_matrix(sheet))
buf = io.StringIO()
writer = csv.writer(buf, lineterminator="\n")
writer.writerows(sheet_matrix)
text_content.append(buf.getvalue().rstrip("\n"))
return TEXT_SECTION_SEPARATOR.join(text_content)

View File

@@ -15,6 +15,7 @@ PLAIN_TEXT_MIME_TYPE = "text/plain"
class OnyxMimeTypes:
IMAGE_MIME_TYPES = {"image/jpg", "image/jpeg", "image/png", "image/webp"}
CSV_MIME_TYPES = {"text/csv"}
TABULAR_MIME_TYPES = CSV_MIME_TYPES | {SPREADSHEET_MIME_TYPE}
TEXT_MIME_TYPES = {
PLAIN_TEXT_MIME_TYPE,
"text/markdown",
@@ -34,13 +35,12 @@ class OnyxMimeTypes:
PDF_MIME_TYPE,
WORD_PROCESSING_MIME_TYPE,
PRESENTATION_MIME_TYPE,
SPREADSHEET_MIME_TYPE,
"message/rfc822",
"application/epub+zip",
}
ALLOWED_MIME_TYPES = IMAGE_MIME_TYPES.union(
TEXT_MIME_TYPES, DOCUMENT_MIME_TYPES, CSV_MIME_TYPES
TEXT_MIME_TYPES, DOCUMENT_MIME_TYPES, TABULAR_MIME_TYPES
)
EXCLUDED_IMAGE_TYPES = {
@@ -53,6 +53,11 @@ class OnyxMimeTypes:
class OnyxFileExtensions:
TABULAR_EXTENSIONS = {
".csv",
".tsv",
".xlsx",
}
PLAIN_TEXT_EXTENSIONS = {
".txt",
".md",

View File

@@ -13,15 +13,21 @@ class ChatFileType(str, Enum):
DOC = "document"
# Plain text only contain the text
PLAIN_TEXT = "plain_text"
CSV = "csv"
# Tabular data files (CSV, XLSX)
TABULAR = "tabular"
def is_text_file(self) -> bool:
return self in (
ChatFileType.PLAIN_TEXT,
ChatFileType.DOC,
ChatFileType.CSV,
ChatFileType.TABULAR,
)
def use_metadata_only(self) -> bool:
"""File types where we can ignore the file content
and only use the metadata."""
return self in (ChatFileType.TABULAR,)
class FileDescriptor(TypedDict):
"""NOTE: is a `TypedDict` so it can be used as a type hint for a JSONB column

View File

@@ -110,16 +110,20 @@ def load_user_file(file_id: UUID, db_session: Session) -> InMemoryChatFile:
# check for plain text normalized version first, then use original file otherwise
try:
file_io = file_store.read_file(plaintext_file_name, mode="b")
# For plaintext versions, use PLAIN_TEXT type (unless it's an image which doesn't have plaintext)
plaintext_chat_file_type = (
ChatFileType.PLAIN_TEXT
if chat_file_type != ChatFileType.IMAGE
else chat_file_type
)
# if we have plaintext for image (which happens when image extraction is enabled), we use PLAIN_TEXT type
if file_io is not None:
# Metadata-only file types preserve their original type so
# downstream injection paths can route them correctly.
if chat_file_type.use_metadata_only():
plaintext_chat_file_type = chat_file_type
elif file_io is not None:
# if we have plaintext for image (which happens when image
# extraction is enabled), we use PLAIN_TEXT type
plaintext_chat_file_type = ChatFileType.PLAIN_TEXT
else:
plaintext_chat_file_type = (
ChatFileType.PLAIN_TEXT
if chat_file_type != ChatFileType.IMAGE
else chat_file_type
)
chat_file = InMemoryChatFile(
file_id=str(user_file.file_id),

View File

@@ -1,4 +1,3 @@
from onyx.configs.app_configs import HOOK_ENABLED
from onyx.error_handling.error_codes import OnyxErrorCode
from onyx.error_handling.exceptions import OnyxError
from shared_configs.configs import MULTI_TENANT
@@ -7,10 +6,7 @@ from shared_configs.configs import MULTI_TENANT
def require_hook_enabled() -> None:
"""FastAPI dependency that gates all hook management endpoints.
Hooks are only available in single-tenant / self-hosted deployments with
HOOK_ENABLED=true explicitly set. Two layers of protection:
1. MULTI_TENANT check — rejects even if HOOK_ENABLED is accidentally set true
2. HOOK_ENABLED flag — explicit opt-in by the operator
Hooks are only available in single-tenant / self-hosted EE deployments.
Use as: Depends(require_hook_enabled)
"""
@@ -19,8 +15,3 @@ def require_hook_enabled() -> None:
OnyxErrorCode.SINGLE_TENANT_ONLY,
"Hooks are not available in multi-tenant deployments",
)
if not HOOK_ENABLED:
raise OnyxError(
OnyxErrorCode.ENV_VAR_GATED,
"Hooks are not enabled. Set HOOK_ENABLED=true to enable.",
)

View File

@@ -1,79 +1,22 @@
"""Hook executor — calls a customer's external HTTP endpoint for a given hook point.
"""CE hook executor.
Usage (Celery tasks and FastAPI handlers):
result = execute_hook(
db_session=db_session,
hook_point=HookPoint.QUERY_PROCESSING,
payload={"query": "...", "user_email": "...", "chat_session_id": "..."},
response_type=QueryProcessingResponse,
)
HookSkipped and HookSoftFailed are real classes kept here because
process_message.py (CE code) uses isinstance checks against them.
if isinstance(result, HookSkipped):
# no active hook configured — continue with original behavior
...
elif isinstance(result, HookSoftFailed):
# hook failed but fail strategy is SOFT — continue with original behavior
...
else:
# result is a validated Pydantic model instance (response_type)
...
is_reachable update policy
--------------------------
``is_reachable`` on the Hook row is updated selectively — only when the outcome
carries meaningful signal about physical reachability:
NetworkError (DNS, connection refused) → False (cannot reach the server)
HTTP 401 / 403 → False (api_key revoked or invalid)
TimeoutException → None (server may be slow, skip write)
Other HTTP errors (4xx / 5xx) → None (server responded, skip write)
Unknown exception → None (no signal, skip write)
Non-JSON / non-dict response → None (server responded, skip write)
Success (2xx, valid dict) → True (confirmed reachable)
None means "leave the current value unchanged" — no DB round-trip is made.
DB session design
-----------------
The executor uses three sessions:
1. Caller's session (db_session) — used only for the hook lookup read. All
needed fields are extracted from the Hook object before the HTTP call, so
the caller's session is not held open during the external HTTP request.
2. Log session — a separate short-lived session opened after the HTTP call
completes to write the HookExecutionLog row on failure. Success runs are
not recorded. Committed independently of everything else.
3. Reachable session — a second short-lived session to update is_reachable on
the Hook. Kept separate from the log session so a concurrent hook deletion
(which causes update_hook__no_commit to raise OnyxError(NOT_FOUND)) cannot
prevent the execution log from being written. This update is best-effort.
execute_hook is the public entry point. It dispatches to _execute_hook_impl
via fetch_versioned_implementation so that:
- CE: onyx.hooks.executor._execute_hook_impl → no-op, returns HookSkipped()
- EE: ee.onyx.hooks.executor._execute_hook_impl → real HTTP call
"""
import json
import time
from typing import Any
from typing import TypeVar
import httpx
from pydantic import BaseModel
from pydantic import ValidationError
from sqlalchemy.orm import Session
from onyx.db.engine.sql_engine import get_session_with_current_tenant
from onyx.db.enums import HookFailStrategy
from onyx.db.enums import HookPoint
from onyx.db.hook import create_hook_execution_log__no_commit
from onyx.db.hook import get_non_deleted_hook_by_hook_point
from onyx.db.hook import update_hook__no_commit
from onyx.db.models import Hook
from onyx.error_handling.error_codes import OnyxErrorCode
from onyx.error_handling.exceptions import OnyxError
from onyx.hooks.utils import HOOKS_AVAILABLE
from onyx.utils.logger import setup_logger
logger = setup_logger()
from onyx.utils.variable_functionality import fetch_versioned_implementation
class HookSkipped:
@@ -87,277 +30,15 @@ class HookSoftFailed:
T = TypeVar("T", bound=BaseModel)
# ---------------------------------------------------------------------------
# Private helpers
# ---------------------------------------------------------------------------
class _HttpOutcome(BaseModel):
"""Structured result of an HTTP hook call, returned by _process_response."""
is_success: bool
updated_is_reachable: (
bool | None
) # True/False = write to DB, None = unchanged (skip write)
status_code: int | None
error_message: str | None
response_payload: dict[str, Any] | None
def _lookup_hook(
db_session: Session,
hook_point: HookPoint,
) -> Hook | HookSkipped:
"""Return the active Hook or HookSkipped if hooks are unavailable/unconfigured.
No HTTP call is made and no DB writes are performed for any HookSkipped path.
There is nothing to log and no reachability information to update.
"""
if not HOOKS_AVAILABLE:
return HookSkipped()
hook = get_non_deleted_hook_by_hook_point(
db_session=db_session, hook_point=hook_point
)
if hook is None or not hook.is_active:
return HookSkipped()
if not hook.endpoint_url:
return HookSkipped()
return hook
def _process_response(
def _execute_hook_impl(
*,
response: httpx.Response | None,
exc: Exception | None,
timeout: float,
) -> _HttpOutcome:
"""Process the result of an HTTP call and return a structured outcome.
Called after the client.post() try/except. If post() raised, exc is set and
response is None. Otherwise response is set and exc is None. Handles
raise_for_status(), JSON decoding, and the dict shape check.
"""
if exc is not None:
if isinstance(exc, httpx.NetworkError):
msg = f"Hook network error (endpoint unreachable): {exc}"
logger.warning(msg, exc_info=exc)
return _HttpOutcome(
is_success=False,
updated_is_reachable=False,
status_code=None,
error_message=msg,
response_payload=None,
)
if isinstance(exc, httpx.TimeoutException):
msg = f"Hook timed out after {timeout}s: {exc}"
logger.warning(msg, exc_info=exc)
return _HttpOutcome(
is_success=False,
updated_is_reachable=None, # timeout doesn't indicate unreachability
status_code=None,
error_message=msg,
response_payload=None,
)
msg = f"Hook call failed: {exc}"
logger.exception(msg, exc_info=exc)
return _HttpOutcome(
is_success=False,
updated_is_reachable=None, # unknown error — don't make assumptions
status_code=None,
error_message=msg,
response_payload=None,
)
if response is None:
raise ValueError(
"exactly one of response or exc must be non-None; both are None"
)
status_code = response.status_code
try:
response.raise_for_status()
except httpx.HTTPStatusError as e:
msg = f"Hook returned HTTP {e.response.status_code}: {e.response.text}"
logger.warning(msg, exc_info=e)
# 401/403 means the api_key has been revoked or is invalid — mark unreachable
# so the operator knows to update it. All other HTTP errors keep is_reachable
# as-is (server is up, the request just failed for application reasons).
auth_failed = e.response.status_code in (401, 403)
return _HttpOutcome(
is_success=False,
updated_is_reachable=False if auth_failed else None,
status_code=status_code,
error_message=msg,
response_payload=None,
)
try:
response_payload = response.json()
except (json.JSONDecodeError, httpx.DecodingError) as e:
msg = f"Hook returned non-JSON response: {e}"
logger.warning(msg, exc_info=e)
return _HttpOutcome(
is_success=False,
updated_is_reachable=None, # server responded — reachability unchanged
status_code=status_code,
error_message=msg,
response_payload=None,
)
if not isinstance(response_payload, dict):
msg = f"Hook returned non-dict JSON (got {type(response_payload).__name__})"
logger.warning(msg)
return _HttpOutcome(
is_success=False,
updated_is_reachable=None, # server responded — reachability unchanged
status_code=status_code,
error_message=msg,
response_payload=None,
)
return _HttpOutcome(
is_success=True,
updated_is_reachable=True,
status_code=status_code,
error_message=None,
response_payload=response_payload,
)
def _persist_result(
*,
hook_id: int,
outcome: _HttpOutcome,
duration_ms: int,
) -> None:
"""Write the execution log on failure and optionally update is_reachable, each
in its own session so a failure in one does not affect the other."""
# Only write the execution log on failure — success runs are not recorded.
# Must not be skipped if the is_reachable update fails (e.g. hook concurrently
# deleted between the initial lookup and here).
if not outcome.is_success:
try:
with get_session_with_current_tenant() as log_session:
create_hook_execution_log__no_commit(
db_session=log_session,
hook_id=hook_id,
is_success=False,
error_message=outcome.error_message,
status_code=outcome.status_code,
duration_ms=duration_ms,
)
log_session.commit()
except Exception:
logger.exception(
f"Failed to persist hook execution log for hook_id={hook_id}"
)
# Update is_reachable separately — best-effort, non-critical.
# None means the value is unchanged (set by the caller to skip the no-op write).
# update_hook__no_commit can raise OnyxError(NOT_FOUND) if the hook was
# concurrently deleted, so keep this isolated from the log write above.
if outcome.updated_is_reachable is not None:
try:
with get_session_with_current_tenant() as reachable_session:
update_hook__no_commit(
db_session=reachable_session,
hook_id=hook_id,
is_reachable=outcome.updated_is_reachable,
)
reachable_session.commit()
except Exception:
logger.warning(f"Failed to update is_reachable for hook_id={hook_id}")
# ---------------------------------------------------------------------------
# Public API
# ---------------------------------------------------------------------------
def _execute_hook_inner(
hook: Hook,
payload: dict[str, Any],
response_type: type[T],
) -> T | HookSoftFailed:
"""Make the HTTP call, validate the response, and return a typed model.
Raises OnyxError on HARD failure. Returns HookSoftFailed on SOFT failure.
"""
timeout = hook.timeout_seconds
hook_id = hook.id
fail_strategy = hook.fail_strategy
endpoint_url = hook.endpoint_url
current_is_reachable: bool | None = hook.is_reachable
if not endpoint_url:
raise ValueError(
f"hook_id={hook_id} is active but has no endpoint_url — "
"active hooks without an endpoint_url must be rejected by _lookup_hook"
)
start = time.monotonic()
response: httpx.Response | None = None
exc: Exception | None = None
try:
api_key: str | None = (
hook.api_key.get_value(apply_mask=False) if hook.api_key else None
)
headers: dict[str, str] = {"Content-Type": "application/json"}
if api_key:
headers["Authorization"] = f"Bearer {api_key}"
with httpx.Client(
timeout=timeout, follow_redirects=False
) as client: # SSRF guard: never follow redirects
response = client.post(endpoint_url, json=payload, headers=headers)
except Exception as e:
exc = e
duration_ms = int((time.monotonic() - start) * 1000)
outcome = _process_response(response=response, exc=exc, timeout=timeout)
# Validate the response payload against response_type.
# A validation failure downgrades the outcome to a failure so it is logged,
# is_reachable is left unchanged (server responded — just a bad payload),
# and fail_strategy is respected below.
validated_model: T | None = None
if outcome.is_success and outcome.response_payload is not None:
try:
validated_model = response_type.model_validate(outcome.response_payload)
except ValidationError as e:
msg = (
f"Hook response failed validation against {response_type.__name__}: {e}"
)
outcome = _HttpOutcome(
is_success=False,
updated_is_reachable=None, # server responded — reachability unchanged
status_code=outcome.status_code,
error_message=msg,
response_payload=None,
)
# Skip the is_reachable write when the value would not change — avoids a
# no-op DB round-trip on every call when the hook is already in the expected state.
if outcome.updated_is_reachable == current_is_reachable:
outcome = outcome.model_copy(update={"updated_is_reachable": None})
_persist_result(hook_id=hook_id, outcome=outcome, duration_ms=duration_ms)
if not outcome.is_success:
if fail_strategy == HookFailStrategy.HARD:
raise OnyxError(
OnyxErrorCode.HOOK_EXECUTION_FAILED,
outcome.error_message or "Hook execution failed.",
)
logger.warning(
f"Hook execution failed (soft fail) for hook_id={hook_id}: {outcome.error_message}"
)
return HookSoftFailed()
if validated_model is None:
raise OnyxError(
OnyxErrorCode.INTERNAL_ERROR,
f"validated_model is None for successful hook call (hook_id={hook_id})",
)
return validated_model
db_session: Session, # noqa: ARG001
hook_point: HookPoint, # noqa: ARG001
payload: dict[str, Any], # noqa: ARG001
response_type: type[T], # noqa: ARG001
) -> T | HookSkipped | HookSoftFailed:
"""CE no-op — hooks are not available without EE."""
return HookSkipped()
def execute_hook(
@@ -367,25 +48,15 @@ def execute_hook(
payload: dict[str, Any],
response_type: type[T],
) -> T | HookSkipped | HookSoftFailed:
"""Execute the hook for the given hook point synchronously.
"""Execute the hook for the given hook point.
Returns HookSkipped if no active hook is configured, HookSoftFailed if the
hook failed with SOFT fail strategy, or a validated response model on success.
Raises OnyxError on HARD failure or if the hook is misconfigured.
Dispatches to the versioned implementation so EE gets the real executor
and CE gets the no-op stub, without any changes at the call site.
"""
hook = _lookup_hook(db_session, hook_point)
if isinstance(hook, HookSkipped):
return hook
fail_strategy = hook.fail_strategy
hook_id = hook.id
try:
return _execute_hook_inner(hook, payload, response_type)
except Exception:
if fail_strategy == HookFailStrategy.SOFT:
logger.exception(
f"Unexpected error in hook execution (soft fail) for hook_id={hook_id}"
)
return HookSoftFailed()
raise
impl = fetch_versioned_implementation("onyx.hooks.executor", "_execute_hook_impl")
return impl(
db_session=db_session,
hook_point=hook_point,
payload=payload,
response_type=response_type,
)

View File

@@ -1,33 +1,114 @@
from pydantic import BaseModel
from pydantic import Field
from onyx.db.enums import HookFailStrategy
from onyx.db.enums import HookPoint
from onyx.hooks.points.base import HookPointSpec
# TODO(@Bo-Onyx): define payload and response fields
class DocumentIngestionSection(BaseModel):
"""Represents a single section of a document — either text or image, not both.
Text section: set `text`, leave `image_file_id` null.
Image section: set `image_file_id`, leave `text` null.
"""
text: str | None = Field(
default=None,
description="Text content of this section. Set for text sections, null for image sections.",
)
link: str | None = Field(
default=None,
description="Optional URL associated with this section. Preserve the original link from the payload if you want it retained.",
)
image_file_id: str | None = Field(
default=None,
description=(
"Opaque identifier for an image stored in the file store. "
"The image content is not included — this field signals that the section is an image. "
"Hooks can use its presence to reorder or drop image sections, but cannot read or modify the image itself."
),
)
class DocumentIngestionOwner(BaseModel):
display_name: str | None = Field(
default=None,
description="Human-readable name of the owner.",
)
email: str | None = Field(
default=None,
description="Email address of the owner.",
)
class DocumentIngestionPayload(BaseModel):
pass
document_id: str = Field(
description="Unique identifier for the document. Read-only — changes are ignored."
)
title: str | None = Field(description="Title of the document.")
semantic_identifier: str = Field(
description="Human-readable identifier used for display (e.g. file name, page title)."
)
source: str = Field(
description=(
"Connector source type (e.g. confluence, slack, google_drive). "
"Read-only — changes are ignored. "
"Full list of values: https://github.com/onyx-dot-app/onyx/blob/main/backend/onyx/configs/constants.py#L195"
)
)
sections: list[DocumentIngestionSection] = Field(
description="Sections of the document. Includes both text sections (text set, image_file_id null) and image sections (image_file_id set, text null)."
)
metadata: dict[str, list[str]] = Field(
description="Key-value metadata attached to the document. Values are always a list of strings."
)
doc_updated_at: str | None = Field(
description="ISO 8601 UTC timestamp of the last update at the source, or null if unknown. Example: '2024-03-15T10:30:00+00:00'."
)
primary_owners: list[DocumentIngestionOwner] | None = Field(
description="Primary owners of the document, or null if not available."
)
secondary_owners: list[DocumentIngestionOwner] | None = Field(
description="Secondary owners of the document, or null if not available."
)
class DocumentIngestionResponse(BaseModel):
pass
# Intentionally permissive — customer endpoints may return extra fields.
sections: list[DocumentIngestionSection] | None = Field(
description="The sections to index, in the desired order. Reorder, drop, or modify sections freely. Null or empty list drops the document."
)
rejection_reason: str | None = Field(
default=None,
description="Logged when sections is null or empty. Falls back to a generic message if omitted.",
)
class DocumentIngestionSpec(HookPointSpec):
"""Hook point that runs during document ingestion.
"""Hook point that runs on every document before it enters the indexing pipeline.
# TODO(@Bo-Onyx): define call site, input/output schema, and timeout budget.
Call site: immediately after Onyx's internal validation and before the
indexing pipeline begins — no partial writes have occurred yet.
If a Document Ingestion hook is configured, it takes precedence —
Document Ingestion Light will not run. Configure only one per deployment.
Supported use cases:
- Document filtering: drop documents based on content or metadata
- Content rewriting: redact PII or normalize text before indexing
"""
hook_point = HookPoint.DOCUMENT_INGESTION
display_name = "Document Ingestion"
description = "Runs during document ingestion. Allows filtering or transforming documents before indexing."
description = (
"Runs on every document before it enters the indexing pipeline. "
"Allows filtering, rewriting, or dropping documents."
)
default_timeout_seconds = 30.0
fail_hard_description = "The document will not be indexed."
default_fail_strategy = HookFailStrategy.HARD
# TODO(Bo-Onyx): update later
docs_url = "https://docs.google.com/document/d/1pGhB8Wcnhhj8rS4baEJL6CX05yFhuIDNk1gbBRiWu94/edit?tab=t.ue263ual5vdi"
docs_url = "https://docs.onyx.app/admins/advanced_configs/hook_extensions#document-ingestion"
payload_model = DocumentIngestionPayload
response_model = DocumentIngestionResponse

View File

@@ -65,8 +65,9 @@ class QueryProcessingSpec(HookPointSpec):
"The query will be blocked and the user will see an error message."
)
default_fail_strategy = HookFailStrategy.HARD
# TODO(Bo-Onyx): update later
docs_url = "https://docs.google.com/document/d/1pGhB8Wcnhhj8rS4baEJL6CX05yFhuIDNk1gbBRiWu94/edit?tab=t.g2r1a1699u87"
docs_url = (
"https://docs.onyx.app/admins/advanced_configs/hook_extensions#query-processing"
)
payload_model = QueryProcessingPayload
response_model = QueryProcessingResponse

View File

@@ -1,5 +0,0 @@
from onyx.configs.app_configs import HOOK_ENABLED
from shared_configs.configs import MULTI_TENANT
# True only when hooks are available: single-tenant deployment with HOOK_ENABLED=true.
HOOKS_AVAILABLE: bool = HOOK_ENABLED and not MULTI_TENANT

View File

@@ -33,6 +33,7 @@ from onyx.connectors.models import TextSection
from onyx.db.document import get_documents_by_ids
from onyx.db.document import upsert_document_by_connector_credential_pair
from onyx.db.document import upsert_documents
from onyx.db.enums import HookPoint
from onyx.db.hierarchy import link_hierarchy_nodes_to_documents
from onyx.db.models import Document as DBDocument
from onyx.db.models import IndexModelStatus
@@ -47,6 +48,13 @@ from onyx.document_index.interfaces import DocumentMetadata
from onyx.document_index.interfaces import IndexBatchParams
from onyx.file_processing.image_summarization import summarize_image_with_error_handling
from onyx.file_store.file_store import get_default_file_store
from onyx.hooks.executor import execute_hook
from onyx.hooks.executor import HookSkipped
from onyx.hooks.executor import HookSoftFailed
from onyx.hooks.points.document_ingestion import DocumentIngestionOwner
from onyx.hooks.points.document_ingestion import DocumentIngestionPayload
from onyx.hooks.points.document_ingestion import DocumentIngestionResponse
from onyx.hooks.points.document_ingestion import DocumentIngestionSection
from onyx.indexing.chunk_batch_store import ChunkBatchStore
from onyx.indexing.chunker import Chunker
from onyx.indexing.embedder import embed_chunks_with_failure_handling
@@ -297,6 +305,7 @@ def index_doc_batch_with_handler(
document_batch: list[Document],
request_id: str | None,
tenant_id: str,
db_session: Session,
adapter: IndexingBatchAdapter,
ignore_time_skip: bool = False,
enable_contextual_rag: bool = False,
@@ -310,6 +319,7 @@ def index_doc_batch_with_handler(
document_batch=document_batch,
request_id=request_id,
tenant_id=tenant_id,
db_session=db_session,
adapter=adapter,
ignore_time_skip=ignore_time_skip,
enable_contextual_rag=enable_contextual_rag,
@@ -785,6 +795,132 @@ def _verify_indexing_completeness(
)
def _apply_document_ingestion_hook(
documents: list[Document],
db_session: Session,
) -> list[Document]:
"""Apply the Document Ingestion hook to each document in the batch.
- HookSkipped / HookSoftFailed → document passes through unchanged.
- Response with sections=None → document is dropped (logged).
- Response with sections → document sections are replaced with the hook's output.
"""
def _build_payload(doc: Document) -> DocumentIngestionPayload:
return DocumentIngestionPayload(
document_id=doc.id or "",
title=doc.title,
semantic_identifier=doc.semantic_identifier,
source=doc.source.value if doc.source is not None else "",
sections=[
DocumentIngestionSection(
text=s.text if isinstance(s, TextSection) else None,
link=s.link,
image_file_id=(
s.image_file_id if isinstance(s, ImageSection) else None
),
)
for s in doc.sections
],
metadata={
k: v if isinstance(v, list) else [v] for k, v in doc.metadata.items()
},
doc_updated_at=(
doc.doc_updated_at.isoformat() if doc.doc_updated_at else None
),
primary_owners=(
[
DocumentIngestionOwner(
display_name=o.get_semantic_name() or None,
email=o.email,
)
for o in doc.primary_owners
]
if doc.primary_owners
else None
),
secondary_owners=(
[
DocumentIngestionOwner(
display_name=o.get_semantic_name() or None,
email=o.email,
)
for o in doc.secondary_owners
]
if doc.secondary_owners
else None
),
)
def _apply_result(
doc: Document,
hook_result: DocumentIngestionResponse | HookSkipped | HookSoftFailed,
) -> Document | None:
"""Return the modified doc, original doc (skip/soft-fail), or None (drop)."""
if isinstance(hook_result, (HookSkipped, HookSoftFailed)):
return doc
if not hook_result.sections:
reason = hook_result.rejection_reason or "Document rejected by hook"
logger.info(
f"Document ingestion hook dropped document doc_id={doc.id!r}: {reason}"
)
return None
new_sections: list[TextSection | ImageSection] = []
for s in hook_result.sections:
if s.image_file_id is not None:
new_sections.append(
ImageSection(image_file_id=s.image_file_id, link=s.link)
)
elif s.text is not None:
new_sections.append(TextSection(text=s.text, link=s.link))
else:
logger.warning(
f"Document ingestion hook returned a section with neither text nor "
f"image_file_id for doc_id={doc.id!r} — skipping section."
)
if not new_sections:
logger.info(
f"Document ingestion hook produced no valid sections for doc_id={doc.id!r} — dropping document."
)
return None
return doc.model_copy(update={"sections": new_sections})
if not documents:
return documents
# Run the hook for the first document. If it returns HookSkipped the hook
# is not configured — skip the remaining N-1 DB lookups.
first_doc = documents[0]
first_payload = _build_payload(first_doc).model_dump()
first_hook_result = execute_hook(
db_session=db_session,
hook_point=HookPoint.DOCUMENT_INGESTION,
payload=first_payload,
response_type=DocumentIngestionResponse,
)
if isinstance(first_hook_result, HookSkipped):
return documents
result: list[Document] = []
first_applied = _apply_result(first_doc, first_hook_result)
if first_applied is not None:
result.append(first_applied)
for doc in documents[1:]:
payload = _build_payload(doc).model_dump()
hook_result = execute_hook(
db_session=db_session,
hook_point=HookPoint.DOCUMENT_INGESTION,
payload=payload,
response_type=DocumentIngestionResponse,
)
applied = _apply_result(doc, hook_result)
if applied is not None:
result.append(applied)
return result
@log_function_time(debug_only=True)
def index_doc_batch(
*,
@@ -794,6 +930,7 @@ def index_doc_batch(
document_indices: list[DocumentIndex],
request_id: str | None,
tenant_id: str,
db_session: Session,
adapter: IndexingBatchAdapter,
enable_contextual_rag: bool = False,
llm: LLM | None = None,
@@ -818,6 +955,7 @@ def index_doc_batch(
)
filtered_documents = filter_fnc(document_batch)
filtered_documents = _apply_document_ingestion_hook(filtered_documents, db_session)
context = adapter.prepare(filtered_documents, ignore_time_skip)
if not context:
return IndexingPipelineResult.empty(len(filtered_documents))
@@ -1005,6 +1143,7 @@ def run_indexing_pipeline(
document_batch=document_batch,
request_id=request_id,
tenant_id=tenant_id,
db_session=db_session,
adapter=adapter,
enable_contextual_rag=enable_contextual_rag,
llm=llm,

View File

@@ -175,6 +175,28 @@ def _strip_tool_content_from_messages(
return result
def _fix_tool_user_message_ordering(
messages: list[dict[str, Any]],
) -> list[dict[str, Any]]:
"""Insert a synthetic assistant message between tool and user messages.
Some models (e.g. Mistral on Azure) require strict message ordering where
a user message cannot immediately follow a tool message. This function
inserts a minimal assistant message to bridge the gap.
"""
if len(messages) < 2:
return messages
result: list[dict[str, Any]] = [messages[0]]
for msg in messages[1:]:
prev_role = result[-1].get("role")
curr_role = msg.get("role")
if prev_role == "tool" and curr_role == "user":
result.append({"role": "assistant", "content": "Noted. Continuing."})
result.append(msg)
return result
def _messages_contain_tool_content(messages: list[dict[str, Any]]) -> bool:
"""Check if any messages contain tool-related content blocks."""
for msg in messages:
@@ -576,6 +598,18 @@ class LitellmLLM(LLM):
):
messages = _strip_tool_content_from_messages(messages)
# Some models (e.g. Mistral) reject a user message
# immediately after a tool message. Insert a synthetic
# assistant bridge message to satisfy the ordering
# constraint. Check both the provider and the deployment/
# model name to catch Mistral hosted on Azure.
model_or_deployment = (
self._deployment_name or self._model_version or ""
).lower()
is_mistral_model = is_mistral or "mistral" in model_or_deployment
if is_mistral_model:
messages = _fix_tool_user_message_ordering(messages)
# Only pass tool_choice when tools are present — some providers (e.g. Fireworks)
# reject requests where tool_choice is explicitly null.
if tools and tool_choice is not None:

View File

@@ -77,7 +77,6 @@ from onyx.server.features.default_assistant.api import (
)
from onyx.server.features.document_set.api import router as document_set_router
from onyx.server.features.hierarchy.api import router as hierarchy_router
from onyx.server.features.hooks.api import router as hook_router
from onyx.server.features.input_prompt.api import (
admin_router as admin_input_prompt_router,
)
@@ -439,6 +438,7 @@ def get_application(lifespan_override: Lifespan | None = None) -> FastAPI:
dsn=SENTRY_DSN,
integrations=[StarletteIntegration(), FastApiIntegration()],
traces_sample_rate=0.1,
release=__version__,
)
logger.info("Sentry initialized")
else:
@@ -454,7 +454,6 @@ def get_application(lifespan_override: Lifespan | None = None) -> FastAPI:
register_onyx_exception_handlers(application)
include_router_with_global_prefix_prepended(application, hook_router)
include_router_with_global_prefix_prepended(application, password_router)
include_router_with_global_prefix_prepended(application, chat_router)
include_router_with_global_prefix_prepended(application, query_router)

View File

@@ -76,11 +76,18 @@ class CategorizedFiles(BaseModel):
acceptable: list[UploadFile] = Field(default_factory=list)
rejected: list[RejectedFile] = Field(default_factory=list)
acceptable_file_to_token_count: dict[str, int] = Field(default_factory=dict)
# Filenames within `acceptable` that should be stored but not indexed.
skip_indexing: set[str] = Field(default_factory=set)
# Allow FastAPI UploadFile instances
model_config = ConfigDict(arbitrary_types_allowed=True)
def _skip_token_threshold(extension: str) -> bool:
"""Return True if this file extension should bypass the token limit."""
return extension.lower() in OnyxFileExtensions.TABULAR_EXTENSIONS
def _apply_long_side_cap(width: int, height: int, cap: int) -> tuple[int, int]:
if max(width, height) <= cap:
return width, height
@@ -264,7 +271,17 @@ def categorize_uploaded_files(
token_count = count_tokens(
text_content, tokenizer, token_limit=token_threshold
)
if token_threshold is not None and token_count > token_threshold:
exceeds_threshold = (
token_threshold is not None and token_count > token_threshold
)
if exceeds_threshold and _skip_token_threshold(extension):
# Exempt extensions (e.g. spreadsheets) are accepted
# but flagged to skip indexing — only metadata is
# injected into the LLM context.
results.acceptable.append(upload)
results.acceptable_file_to_token_count[filename] = token_count
results.skip_indexing.add(filename)
elif exceeds_threshold:
results.rejected.append(
RejectedFile(
filename=filename,

View File

@@ -147,6 +147,7 @@ class UserInfo(BaseModel):
is_anonymous_user: bool | None = None,
tenant_info: TenantInfo | None = None,
assistant_specific_configs: UserSpecificAssistantPreferences | None = None,
memories: list[MemoryItem] | None = None,
) -> "UserInfo":
return cls(
id=str(user.id),
@@ -191,10 +192,7 @@ class UserInfo(BaseModel):
role=user.personal_role or "",
use_memories=user.use_memories,
enable_memory_tool=user.enable_memory_tool,
memories=[
MemoryItem(id=memory.id, content=memory.memory_text)
for memory in (user.memories or [])
],
memories=memories or [],
user_preferences=user.user_preferences or "",
),
)

View File

@@ -57,6 +57,7 @@ from onyx.db.user_preferences import activate_user
from onyx.db.user_preferences import deactivate_user
from onyx.db.user_preferences import get_all_user_assistant_specific_configs
from onyx.db.user_preferences import get_latest_access_token_for_user
from onyx.db.user_preferences import get_memories_for_user
from onyx.db.user_preferences import update_assistant_preferences
from onyx.db.user_preferences import update_user_assistant_visibility
from onyx.db.user_preferences import update_user_auto_scroll
@@ -823,6 +824,11 @@ def verify_user_logged_in(
[],
),
)
memories = [
MemoryItem(id=memory.id, content=memory.memory_text)
for memory in get_memories_for_user(user.id, db_session)
]
user_info = UserInfo.from_model(
user,
current_token_created_at=token_created_at,
@@ -833,6 +839,7 @@ def verify_user_logged_in(
new_tenant=new_tenant,
invitation=tenant_invitation,
),
memories=memories,
)
return user_info
@@ -930,7 +937,8 @@ def update_user_personalization_api(
else user.enable_memory_tool
)
existing_memories = [
MemoryItem(id=memory.id, content=memory.memory_text) for memory in user.memories
MemoryItem(id=memory.id, content=memory.memory_text)
for memory in get_memories_for_user(user.id, db_session)
]
new_memories = (
request.memories if request.memories is not None else existing_memories

View File

@@ -9,8 +9,8 @@ def mime_type_to_chat_file_type(mime_type: str | None) -> ChatFileType:
if mime_type in OnyxMimeTypes.IMAGE_MIME_TYPES:
return ChatFileType.IMAGE
if mime_type in OnyxMimeTypes.CSV_MIME_TYPES:
return ChatFileType.CSV
if mime_type in OnyxMimeTypes.TABULAR_MIME_TYPES:
return ChatFileType.TABULAR
if mime_type in OnyxMimeTypes.DOCUMENT_MIME_TYPES:
return ChatFileType.DOC

View File

@@ -21,7 +21,6 @@ from onyx.db.notification import get_notifications
from onyx.db.notification import update_notification_last_shown
from onyx.error_handling.error_codes import OnyxErrorCode
from onyx.error_handling.exceptions import OnyxError
from onyx.hooks.utils import HOOKS_AVAILABLE
from onyx.key_value_store.factory import get_kv_store
from onyx.key_value_store.interface import KvKeyNotFoundError
from onyx.server.features.build.utils import is_onyx_craft_enabled
@@ -38,6 +37,7 @@ from onyx.utils.logger import setup_logger
from onyx.utils.variable_functionality import (
fetch_versioned_implementation_with_fallback,
)
from shared_configs.configs import MULTI_TENANT
logger = setup_logger()
@@ -98,7 +98,7 @@ def fetch_settings(
needs_reindexing=needs_reindexing,
onyx_craft_enabled=onyx_craft_enabled_for_user,
vector_db_enabled=not DISABLE_VECTOR_DB,
hooks_enabled=HOOKS_AVAILABLE,
hooks_enabled=not MULTI_TENANT,
version=onyx_version,
max_allowed_upload_size_mb=MAX_ALLOWED_UPLOAD_SIZE_MB,
default_user_file_max_upload_size_mb=min(

View File

@@ -116,7 +116,7 @@ class UserSettings(Settings):
# False when DISABLE_VECTOR_DB is set — connectors, RAG search, and
# document sets are unavailable.
vector_db_enabled: bool = True
# True when hooks are available: single-tenant deployment with HOOK_ENABLED=true.
# True when hooks are available: single-tenant EE deployments only.
hooks_enabled: bool = False
# Application version, read from the ONYX_VERSION env var at startup.
version: str | None = None

View File

@@ -1,3 +1,4 @@
import io
import json
from typing import Any
from typing import cast
@@ -9,6 +10,7 @@ from typing_extensions import override
from onyx.chat.emitter import Emitter
from onyx.configs.app_configs import DISABLE_VECTOR_DB
from onyx.db.engine.sql_engine import get_session_with_current_tenant
from onyx.file_processing.extract_file_text import extract_file_text
from onyx.file_store.models import ChatFileType
from onyx.file_store.models import InMemoryChatFile
from onyx.file_store.utils import load_chat_file_by_id
@@ -169,10 +171,13 @@ class FileReaderTool(Tool[FileReaderToolOverrideKwargs]):
chat_file = self._load_file(file_id)
# Only PLAIN_TEXT and CSV are guaranteed to contain actual text bytes.
# Only PLAIN_TEXT and TABULAR are guaranteed to contain actual text bytes.
# DOC type in a loaded file means plaintext extraction failed and the
# content is the original binary (e.g. raw PDF/DOCX bytes).
if chat_file.file_type not in (ChatFileType.PLAIN_TEXT, ChatFileType.CSV):
if chat_file.file_type not in (
ChatFileType.PLAIN_TEXT,
ChatFileType.TABULAR,
):
raise ToolCallException(
message=f"File {file_id} is not a text file (type={chat_file.file_type})",
llm_facing_message=(
@@ -181,7 +186,19 @@ class FileReaderTool(Tool[FileReaderToolOverrideKwargs]):
)
try:
full_text = chat_file.content.decode("utf-8", errors="replace")
if chat_file.file_type == ChatFileType.PLAIN_TEXT:
full_text = chat_file.content.decode("utf-8", errors="replace")
else:
full_text = (
extract_file_text(
file=io.BytesIO(chat_file.content),
file_name=chat_file.filename or "",
break_on_unprocessable=False,
)
or ""
)
except ToolCallException:
raise
except Exception:
raise ToolCallException(
message=f"Failed to decode file {file_id}",

View File

@@ -458,6 +458,27 @@ def run_async_sync_no_cancel(coro: Awaitable[T]) -> T:
return future.result()
def run_multiple_in_background(
funcs: list[Callable[[], None]],
thread_name_prefix: str = "worker",
) -> ThreadPoolExecutor:
"""Submit multiple callables to a ``ThreadPoolExecutor`` with context propagation.
Copies the current ``contextvars`` context once and runs every callable
inside that copy, which is important for preserving tenant IDs and other
context-local state across threads.
Returns the executor so the caller can ``shutdown()`` when done.
"""
ctx = contextvars.copy_context()
executor = ThreadPoolExecutor(
max_workers=len(funcs), thread_name_prefix=thread_name_prefix
)
for func in funcs:
executor.submit(ctx.run, func)
return executor
class TimeoutThread(threading.Thread, Generic[R]):
def __init__(
self, timeout: float, func: Callable[..., R], *args: Any, **kwargs: Any

View File

@@ -14,7 +14,7 @@ aiofiles==25.1.0
# unstructured-client
aiohappyeyeballs==2.6.1
# via aiohttp
aiohttp==3.13.3
aiohttp==3.13.4
# via
# aiobotocore
# discord-py
@@ -271,7 +271,7 @@ fastapi-users-db-sqlalchemy==7.0.0
# via onyx
fastavro==1.12.1
# via cohere
fastmcp==3.0.2
fastmcp==3.2.0
# via onyx
fastuuid==0.14.0
# via litellm
@@ -1102,6 +1102,8 @@ tzdata==2025.2
# tzlocal
tzlocal==5.3.1
# via dateparser
uncalled-for==0.2.0
# via fastmcp
unstructured==0.18.27
# via onyx
unstructured-client==0.42.6

View File

@@ -10,7 +10,7 @@ aiofiles==25.1.0
# via aioboto3
aiohappyeyeballs==2.6.1
# via aiohttp
aiohttp==3.13.3
aiohttp==3.13.4
# via
# aiobotocore
# discord-py

View File

@@ -10,7 +10,7 @@ aiofiles==25.1.0
# via aioboto3
aiohappyeyeballs==2.6.1
# via aiohttp
aiohttp==3.13.3
aiohttp==3.13.4
# via
# aiobotocore
# discord-py

View File

@@ -12,7 +12,7 @@ aiofiles==25.1.0
# via aioboto3
aiohappyeyeballs==2.6.1
# via aiohttp
aiohttp==3.13.3
aiohttp==3.13.4
# via
# aiobotocore
# discord-py

View File

@@ -5,6 +5,7 @@ import asyncio
import json
import logging
import sys
import time
from dataclasses import asdict
from dataclasses import dataclass
from pathlib import Path
@@ -27,6 +28,9 @@ INTERNAL_SEARCH_TOOL_NAME = "internal_search"
INTERNAL_SEARCH_IN_CODE_TOOL_ID = "SearchTool"
MAX_REQUEST_ATTEMPTS = 5
RETRIABLE_STATUS_CODES = {429, 500, 502, 503, 504}
QUESTION_TIMEOUT_SECONDS = 300
QUESTION_RETRY_PAUSE_SECONDS = 30
MAX_QUESTION_ATTEMPTS = 3
@dataclass(frozen=True)
@@ -109,6 +113,27 @@ def normalize_api_base(api_base: str) -> str:
return f"{normalized}/api"
def load_completed_question_ids(output_file: Path) -> set[str]:
if not output_file.exists():
return set()
completed_ids: set[str] = set()
with output_file.open("r", encoding="utf-8") as file:
for line in file:
stripped = line.strip()
if not stripped:
continue
try:
record = json.loads(stripped)
except json.JSONDecodeError:
continue
question_id = record.get("question_id")
if isinstance(question_id, str) and question_id:
completed_ids.add(question_id)
return completed_ids
def load_questions(questions_file: Path) -> list[QuestionRecord]:
if not questions_file.exists():
raise FileNotFoundError(f"Questions file not found: {questions_file}")
@@ -348,6 +373,7 @@ async def generate_answers(
api_base: str,
api_key: str,
parallelism: int,
skipped: int,
) -> None:
if parallelism < 1:
raise ValueError("`--parallelism` must be at least 1.")
@@ -382,58 +408,178 @@ async def generate_answers(
write_lock = asyncio.Lock()
completed = 0
successful = 0
stuck_count = 0
failed_questions: list[FailedQuestionRecord] = []
total = len(questions)
remaining_count = len(questions)
overall_total = remaining_count + skipped
question_durations: list[float] = []
run_start_time = time.monotonic()
def print_progress() -> None:
avg_time = (
sum(question_durations) / len(question_durations)
if question_durations
else 0.0
)
elapsed = time.monotonic() - run_start_time
eta = avg_time * (remaining_count - completed) / max(parallelism, 1)
done = skipped + completed
bar_width = 30
filled = (
int(bar_width * done / overall_total)
if overall_total
else bar_width
)
bar = "" * filled + "" * (bar_width - filled)
pct = (done / overall_total * 100) if overall_total else 100.0
parts = (
f"\r{bar} {pct:5.1f}% "
f"[{done}/{overall_total}] "
f"avg {avg_time:.1f}s/q "
f"elapsed {elapsed:.0f}s "
f"ETA {eta:.0f}s "
f"(ok:{successful} fail:{len(failed_questions)}"
)
if stuck_count:
parts += f" stuck:{stuck_count}"
if skipped:
parts += f" skip:{skipped}"
parts += ")"
sys.stderr.write(parts)
sys.stderr.flush()
print_progress()
async def process_question(question_record: QuestionRecord) -> None:
nonlocal completed
nonlocal successful
nonlocal stuck_count
try:
async with semaphore:
result = await submit_question(
session=session,
api_base=api_base,
headers=headers,
internal_search_tool_id=internal_search_tool_id,
question_record=question_record,
last_error: Exception | None = None
for attempt in range(1, MAX_QUESTION_ATTEMPTS + 1):
q_start = time.monotonic()
try:
async with semaphore:
result = await asyncio.wait_for(
submit_question(
session=session,
api_base=api_base,
headers=headers,
internal_search_tool_id=internal_search_tool_id,
question_record=question_record,
),
timeout=QUESTION_TIMEOUT_SECONDS,
)
except asyncio.TimeoutError:
async with progress_lock:
stuck_count += 1
logger.warning(
"Question %s timed out after %ss (attempt %s/%s, "
"total stuck: %s) — retrying in %ss",
question_record.question_id,
QUESTION_TIMEOUT_SECONDS,
attempt,
MAX_QUESTION_ATTEMPTS,
stuck_count,
QUESTION_RETRY_PAUSE_SECONDS,
)
print_progress()
last_error = TimeoutError(
f"Timed out after {QUESTION_TIMEOUT_SECONDS}s "
f"on attempt {attempt}/{MAX_QUESTION_ATTEMPTS}"
)
except Exception as exc:
await asyncio.sleep(QUESTION_RETRY_PAUSE_SECONDS)
continue
except Exception as exc:
duration = time.monotonic() - q_start
async with progress_lock:
completed += 1
question_durations.append(duration)
failed_questions.append(
FailedQuestionRecord(
question_id=question_record.question_id,
error=str(exc),
)
)
logger.exception(
"Failed question %s (%s/%s)",
question_record.question_id,
completed,
remaining_count,
)
print_progress()
return
duration = time.monotonic() - q_start
async with write_lock:
file.write(json.dumps(asdict(result), ensure_ascii=False))
file.write("\n")
file.flush()
async with progress_lock:
completed += 1
failed_questions.append(
FailedQuestionRecord(
question_id=question_record.question_id,
error=str(exc),
)
)
logger.exception(
"Failed question %s (%s/%s)",
question_record.question_id,
completed,
total,
)
successful += 1
question_durations.append(duration)
print_progress()
return
async with write_lock:
file.write(json.dumps(asdict(result), ensure_ascii=False))
file.write("\n")
file.flush()
# All attempts exhausted due to timeouts
async with progress_lock:
completed += 1
successful += 1
logger.info("Processed %s/%s questions", completed, total)
failed_questions.append(
FailedQuestionRecord(
question_id=question_record.question_id,
error=str(last_error),
)
)
logger.error(
"Question %s failed after %s timeout attempts (%s/%s)",
question_record.question_id,
MAX_QUESTION_ATTEMPTS,
completed,
remaining_count,
)
print_progress()
await asyncio.gather(
*(process_question(question_record) for question_record in questions)
)
# Final newline after progress bar
sys.stderr.write("\n")
sys.stderr.flush()
total_elapsed = time.monotonic() - run_start_time
avg_time = (
sum(question_durations) / len(question_durations)
if question_durations
else 0.0
)
stuck_suffix = f", {stuck_count} stuck timeouts" if stuck_count else ""
resume_suffix = (
f"{skipped} previously completed, "
f"{skipped + successful}/{overall_total} overall"
if skipped
else ""
)
logger.info(
"Done: %s/%s successful in %.1fs (avg %.1fs/question%s)%s",
successful,
remaining_count,
total_elapsed,
avg_time,
stuck_suffix,
resume_suffix,
)
if failed_questions:
logger.warning(
"Completed with %s failed questions and %s successful questions.",
"%s questions failed:",
len(failed_questions),
successful,
)
for failed_question in failed_questions:
logger.warning(
@@ -453,7 +599,30 @@ def main() -> None:
raise ValueError("`--max-questions` must be at least 1 when provided.")
questions = questions[: args.max_questions]
logger.info("Loaded %s questions from %s", len(questions), args.questions_file)
completed_ids = load_completed_question_ids(args.output_file)
logger.info(
"Found %s already-answered question IDs in %s",
len(completed_ids),
args.output_file,
)
total_before_filter = len(questions)
questions = [q for q in questions if q.question_id not in completed_ids]
skipped = total_before_filter - len(questions)
if skipped:
logger.info(
"Resuming: %s/%s already answered, %s remaining",
skipped,
total_before_filter,
len(questions),
)
else:
logger.info("Loaded %s questions from %s", len(questions), args.questions_file)
if not questions:
logger.info("All questions already answered. Nothing to do.")
return
logger.info("Writing answers to %s", args.output_file)
asyncio.run(
@@ -463,6 +632,7 @@ def main() -> None:
api_base=api_base,
api_key=args.api_key,
parallelism=args.parallelism,
skipped=skipped,
)
)

View File

@@ -1,10 +1,8 @@
from collections.abc import Iterable
from typing import Any
from unittest.mock import MagicMock
from unittest.mock import patch
from onyx.connectors.google_drive.connector import GoogleDriveConnector
from onyx.connectors.google_drive.file_retrieval import DriveFileFieldType
from onyx.connectors.google_drive.file_retrieval import has_link_only_permission
from onyx.connectors.google_drive.models import DriveRetrievalStage
from onyx.connectors.google_drive.models import RetrievedDriveFile
@@ -75,10 +73,8 @@ def test_connector_skips_link_only_files_when_enabled() -> None:
retrieved_file = _build_retrieved_file(
[{"type": "domain", "allowFileDiscovery": False}]
)
fetch_mock = MagicMock(return_value=iter([retrieved_file]))
with (
patch.object(connector, "_fetch_drive_items", fetch_mock),
patch(
"onyx.connectors.google_drive.connector.run_functions_tuples_in_parallel",
side_effect=_stub_run_functions,
@@ -93,21 +89,16 @@ def test_connector_skips_link_only_files_when_enabled() -> None:
convert_mock.return_value = "doc"
checkpoint = connector.build_dummy_checkpoint()
results = list(
connector._extract_docs_from_google_drive(
connector._convert_retrieved_files_to_documents(
drive_files_iter=iter([retrieved_file]),
checkpoint=checkpoint,
start=None,
end=None,
include_permissions=False,
)
)
assert results == []
convert_mock.assert_not_called()
fetch_mock.assert_called_once()
get_new_ancestors_mock.assert_called_once()
assert (
fetch_mock.call_args.kwargs["field_type"] == DriveFileFieldType.WITH_PERMISSIONS
)
def test_connector_processes_files_when_option_disabled() -> None:
@@ -115,10 +106,8 @@ def test_connector_processes_files_when_option_disabled() -> None:
retrieved_file = _build_retrieved_file(
[{"type": "domain", "allowFileDiscovery": False}]
)
fetch_mock = MagicMock(return_value=iter([retrieved_file]))
with (
patch.object(connector, "_fetch_drive_items", fetch_mock),
patch(
"onyx.connectors.google_drive.connector.run_functions_tuples_in_parallel",
side_effect=_stub_run_functions,
@@ -133,16 +122,13 @@ def test_connector_processes_files_when_option_disabled() -> None:
convert_mock.return_value = "doc"
checkpoint = connector.build_dummy_checkpoint()
results = list(
connector._extract_docs_from_google_drive(
connector._convert_retrieved_files_to_documents(
drive_files_iter=iter([retrieved_file]),
checkpoint=checkpoint,
start=None,
end=None,
include_permissions=False,
)
)
assert len(results) == 1
convert_mock.assert_called_once()
fetch_mock.assert_called_once()
get_new_ancestors_mock.assert_called_once()
assert fetch_mock.call_args.kwargs["field_type"] == DriveFileFieldType.STANDARD

View File

@@ -1175,7 +1175,7 @@ def test_code_interpreter_receives_chat_files(
file_descriptor: FileDescriptor = {
"id": user_file.file_id,
"type": ChatFileType.CSV,
"type": ChatFileType.TABULAR,
"name": "data.csv",
"user_file_id": str(user_file.id),
}

View File

@@ -1,3 +1,9 @@
import mimetypes
from typing import Any
import requests
from tests.integration.common_utils.constants import API_SERVER_URL
from tests.integration.common_utils.managers.chat import ChatSessionManager
from tests.integration.common_utils.managers.file import FileManager
from tests.integration.common_utils.managers.llm_provider import LLMProviderManager
@@ -79,3 +85,90 @@ def test_send_message_with_text_file_attachment(admin_user: DATestUser) -> None:
assert (
"third line" in response.full_message.lower()
), "Chat response should contain the contents of the file"
def _set_token_threshold(admin_user: DATestUser, threshold_k: int) -> None:
"""Set the file token count threshold via admin settings API."""
response = requests.put(
f"{API_SERVER_URL}/admin/settings",
json={"file_token_count_threshold_k": threshold_k},
headers=admin_user.headers,
)
response.raise_for_status()
def _upload_raw(
filename: str,
content: bytes,
user: DATestUser,
) -> dict[str, Any]:
"""Upload a file and return the full JSON response (user_files + rejected_files)."""
mime_type, _ = mimetypes.guess_type(filename)
headers = user.headers.copy()
headers.pop("Content-Type", None)
response = requests.post(
f"{API_SERVER_URL}/user/projects/file/upload",
files=[("files", (filename, content, mime_type or "application/octet-stream"))],
headers=headers,
)
response.raise_for_status()
return response.json()
def test_csv_over_token_threshold_uploaded_not_indexed(
admin_user: DATestUser,
) -> None:
"""CSV exceeding token threshold is uploaded (accepted) but skips indexing."""
_set_token_threshold(admin_user, threshold_k=1)
try:
# ~2000 tokens with default tokenizer, well over 1K threshold
content = ("x " * 100 + "\n") * 20
result = _upload_raw("large.csv", content.encode(), admin_user)
assert len(result["user_files"]) == 1, "CSV should be accepted"
assert len(result["rejected_files"]) == 0, "CSV should not be rejected"
assert (
result["user_files"][0]["status"] == "SKIPPED"
), "CSV over threshold should be SKIPPED (uploaded but not indexed)"
assert (
result["user_files"][0]["chunk_count"] is None
), "Skipped file should have no chunks"
finally:
_set_token_threshold(admin_user, threshold_k=200)
def test_csv_under_token_threshold_uploaded_and_indexed(
admin_user: DATestUser,
) -> None:
"""CSV under token threshold is uploaded and queued for indexing."""
_set_token_threshold(admin_user, threshold_k=200)
try:
content = "col1,col2\na,b\n"
result = _upload_raw("small.csv", content.encode(), admin_user)
assert len(result["user_files"]) == 1, "CSV should be accepted"
assert len(result["rejected_files"]) == 0, "CSV should not be rejected"
assert (
result["user_files"][0]["status"] == "PROCESSING"
), "CSV under threshold should be PROCESSING (queued for indexing)"
finally:
_set_token_threshold(admin_user, threshold_k=200)
def test_txt_over_token_threshold_rejected(
admin_user: DATestUser,
) -> None:
"""Non-exempt file exceeding token threshold is rejected entirely."""
_set_token_threshold(admin_user, threshold_k=1)
try:
# ~2000 tokens, well over 1K threshold. Unlike CSV, .txt is not
# exempt from the threshold so the file should be rejected.
content = ("x " * 100 + "\n") * 20
result = _upload_raw("big.txt", content.encode(), admin_user)
assert len(result["user_files"]) == 0, "File should not be accepted"
assert len(result["rejected_files"]) == 1, "File should be rejected"
assert "token limit" in result["rejected_files"][0]["reason"].lower()
finally:
_set_token_threshold(admin_user, threshold_k=200)

View File

@@ -9,11 +9,11 @@ import httpx
import pytest
from pydantic import BaseModel
from ee.onyx.hooks.executor import _execute_hook_impl as execute_hook
from onyx.db.enums import HookFailStrategy
from onyx.db.enums import HookPoint
from onyx.error_handling.error_codes import OnyxErrorCode
from onyx.error_handling.exceptions import OnyxError
from onyx.hooks.executor import execute_hook
from onyx.hooks.executor import HookSkipped
from onyx.hooks.executor import HookSoftFailed
from onyx.hooks.points.query_processing import QueryProcessingResponse
@@ -118,28 +118,30 @@ def db_session() -> MagicMock:
@pytest.mark.parametrize(
"hooks_available,hook",
"multi_tenant,hook",
[
# HOOKS_AVAILABLE=False exits before the DB lookup — hook is irrelevant.
pytest.param(False, None, id="hooks_not_available"),
pytest.param(True, None, id="hook_not_found"),
pytest.param(True, _make_hook(is_active=False), id="hook_inactive"),
pytest.param(True, _make_hook(endpoint_url=None), id="no_endpoint_url"),
# MULTI_TENANT=True exits before the DB lookup — hook is irrelevant.
pytest.param(True, None, id="multi_tenant"),
pytest.param(False, None, id="hook_not_found"),
pytest.param(False, _make_hook(is_active=False), id="hook_inactive"),
pytest.param(False, _make_hook(endpoint_url=None), id="no_endpoint_url"),
],
)
def test_early_exit_returns_skipped_with_no_db_writes(
db_session: MagicMock,
hooks_available: bool,
multi_tenant: bool,
hook: MagicMock | None,
) -> None:
with (
patch("onyx.hooks.executor.HOOKS_AVAILABLE", hooks_available),
patch("ee.onyx.hooks.executor.MULTI_TENANT", multi_tenant),
patch(
"onyx.hooks.executor.get_non_deleted_hook_by_hook_point",
"ee.onyx.hooks.executor.get_non_deleted_hook_by_hook_point",
return_value=hook,
),
patch("onyx.hooks.executor.update_hook__no_commit") as mock_update,
patch("onyx.hooks.executor.create_hook_execution_log__no_commit") as mock_log,
patch("ee.onyx.hooks.executor.update_hook__no_commit") as mock_update,
patch(
"ee.onyx.hooks.executor.create_hook_execution_log__no_commit"
) as mock_log,
):
result = execute_hook(
db_session=db_session,
@@ -164,14 +166,16 @@ def test_success_returns_validated_model_and_sets_reachable(
hook = _make_hook()
with (
patch("onyx.hooks.executor.HOOKS_AVAILABLE", True),
patch("ee.onyx.hooks.executor.MULTI_TENANT", False),
patch(
"onyx.hooks.executor.get_non_deleted_hook_by_hook_point",
"ee.onyx.hooks.executor.get_non_deleted_hook_by_hook_point",
return_value=hook,
),
patch("onyx.hooks.executor.get_session_with_current_tenant"),
patch("onyx.hooks.executor.update_hook__no_commit") as mock_update,
patch("onyx.hooks.executor.create_hook_execution_log__no_commit") as mock_log,
patch("ee.onyx.hooks.executor.get_session_with_current_tenant"),
patch("ee.onyx.hooks.executor.update_hook__no_commit") as mock_update,
patch(
"ee.onyx.hooks.executor.create_hook_execution_log__no_commit"
) as mock_log,
patch("httpx.Client") as mock_client_cls,
):
_setup_client(mock_client_cls, response=_make_response())
@@ -195,14 +199,14 @@ def test_success_skips_reachable_write_when_already_true(db_session: MagicMock)
hook = _make_hook(is_reachable=True)
with (
patch("onyx.hooks.executor.HOOKS_AVAILABLE", True),
patch("ee.onyx.hooks.executor.MULTI_TENANT", False),
patch(
"onyx.hooks.executor.get_non_deleted_hook_by_hook_point",
"ee.onyx.hooks.executor.get_non_deleted_hook_by_hook_point",
return_value=hook,
),
patch("onyx.hooks.executor.get_session_with_current_tenant"),
patch("onyx.hooks.executor.update_hook__no_commit") as mock_update,
patch("onyx.hooks.executor.create_hook_execution_log__no_commit"),
patch("ee.onyx.hooks.executor.get_session_with_current_tenant"),
patch("ee.onyx.hooks.executor.update_hook__no_commit") as mock_update,
patch("ee.onyx.hooks.executor.create_hook_execution_log__no_commit"),
patch("httpx.Client") as mock_client_cls,
):
_setup_client(mock_client_cls, response=_make_response())
@@ -224,14 +228,16 @@ def test_non_dict_json_response_is_a_failure(db_session: MagicMock) -> None:
hook = _make_hook(fail_strategy=HookFailStrategy.SOFT)
with (
patch("onyx.hooks.executor.HOOKS_AVAILABLE", True),
patch("ee.onyx.hooks.executor.MULTI_TENANT", False),
patch(
"onyx.hooks.executor.get_non_deleted_hook_by_hook_point",
"ee.onyx.hooks.executor.get_non_deleted_hook_by_hook_point",
return_value=hook,
),
patch("onyx.hooks.executor.get_session_with_current_tenant"),
patch("onyx.hooks.executor.update_hook__no_commit") as mock_update,
patch("onyx.hooks.executor.create_hook_execution_log__no_commit") as mock_log,
patch("ee.onyx.hooks.executor.get_session_with_current_tenant"),
patch("ee.onyx.hooks.executor.update_hook__no_commit") as mock_update,
patch(
"ee.onyx.hooks.executor.create_hook_execution_log__no_commit"
) as mock_log,
patch("httpx.Client") as mock_client_cls,
):
_setup_client(
@@ -258,14 +264,16 @@ def test_json_decode_failure_is_a_failure(db_session: MagicMock) -> None:
hook = _make_hook(fail_strategy=HookFailStrategy.SOFT)
with (
patch("onyx.hooks.executor.HOOKS_AVAILABLE", True),
patch("ee.onyx.hooks.executor.MULTI_TENANT", False),
patch(
"onyx.hooks.executor.get_non_deleted_hook_by_hook_point",
"ee.onyx.hooks.executor.get_non_deleted_hook_by_hook_point",
return_value=hook,
),
patch("onyx.hooks.executor.get_session_with_current_tenant"),
patch("onyx.hooks.executor.update_hook__no_commit") as mock_update,
patch("onyx.hooks.executor.create_hook_execution_log__no_commit") as mock_log,
patch("ee.onyx.hooks.executor.get_session_with_current_tenant"),
patch("ee.onyx.hooks.executor.update_hook__no_commit") as mock_update,
patch(
"ee.onyx.hooks.executor.create_hook_execution_log__no_commit"
) as mock_log,
patch("httpx.Client") as mock_client_cls,
):
_setup_client(
@@ -384,14 +392,14 @@ def test_http_failure_paths(
hook = _make_hook(fail_strategy=fail_strategy)
with (
patch("onyx.hooks.executor.HOOKS_AVAILABLE", True),
patch("ee.onyx.hooks.executor.MULTI_TENANT", False),
patch(
"onyx.hooks.executor.get_non_deleted_hook_by_hook_point",
"ee.onyx.hooks.executor.get_non_deleted_hook_by_hook_point",
return_value=hook,
),
patch("onyx.hooks.executor.get_session_with_current_tenant"),
patch("onyx.hooks.executor.update_hook__no_commit") as mock_update,
patch("onyx.hooks.executor.create_hook_execution_log__no_commit"),
patch("ee.onyx.hooks.executor.get_session_with_current_tenant"),
patch("ee.onyx.hooks.executor.update_hook__no_commit") as mock_update,
patch("ee.onyx.hooks.executor.create_hook_execution_log__no_commit"),
patch("httpx.Client") as mock_client_cls,
):
_setup_client(mock_client_cls, side_effect=exception)
@@ -443,14 +451,14 @@ def test_authorization_header(
hook = _make_hook(api_key=api_key)
with (
patch("onyx.hooks.executor.HOOKS_AVAILABLE", True),
patch("ee.onyx.hooks.executor.MULTI_TENANT", False),
patch(
"onyx.hooks.executor.get_non_deleted_hook_by_hook_point",
"ee.onyx.hooks.executor.get_non_deleted_hook_by_hook_point",
return_value=hook,
),
patch("onyx.hooks.executor.get_session_with_current_tenant"),
patch("onyx.hooks.executor.update_hook__no_commit"),
patch("onyx.hooks.executor.create_hook_execution_log__no_commit"),
patch("ee.onyx.hooks.executor.get_session_with_current_tenant"),
patch("ee.onyx.hooks.executor.update_hook__no_commit"),
patch("ee.onyx.hooks.executor.create_hook_execution_log__no_commit"),
patch("httpx.Client") as mock_client_cls,
):
mock_client = _setup_client(mock_client_cls, response=_make_response())
@@ -489,13 +497,13 @@ def test_persist_session_failure_is_swallowed(
hook = _make_hook(fail_strategy=HookFailStrategy.HARD)
with (
patch("onyx.hooks.executor.HOOKS_AVAILABLE", True),
patch("ee.onyx.hooks.executor.MULTI_TENANT", False),
patch(
"onyx.hooks.executor.get_non_deleted_hook_by_hook_point",
"ee.onyx.hooks.executor.get_non_deleted_hook_by_hook_point",
return_value=hook,
),
patch(
"onyx.hooks.executor.get_session_with_current_tenant",
"ee.onyx.hooks.executor.get_session_with_current_tenant",
side_effect=RuntimeError("DB unavailable"),
),
patch("httpx.Client") as mock_client_cls,
@@ -556,14 +564,16 @@ def test_response_validation_failure_respects_fail_strategy(
hook = _make_hook(fail_strategy=fail_strategy)
with (
patch("onyx.hooks.executor.HOOKS_AVAILABLE", True),
patch("ee.onyx.hooks.executor.MULTI_TENANT", False),
patch(
"onyx.hooks.executor.get_non_deleted_hook_by_hook_point",
"ee.onyx.hooks.executor.get_non_deleted_hook_by_hook_point",
return_value=hook,
),
patch("onyx.hooks.executor.get_session_with_current_tenant"),
patch("onyx.hooks.executor.update_hook__no_commit") as mock_update,
patch("onyx.hooks.executor.create_hook_execution_log__no_commit") as mock_log,
patch("ee.onyx.hooks.executor.get_session_with_current_tenant"),
patch("ee.onyx.hooks.executor.update_hook__no_commit") as mock_update,
patch(
"ee.onyx.hooks.executor.create_hook_execution_log__no_commit"
) as mock_log,
patch("httpx.Client") as mock_client_cls,
):
# Response payload is missing required_field → ValidationError
@@ -619,13 +629,13 @@ def test_unexpected_exception_in_inner_respects_fail_strategy(
hook = _make_hook(fail_strategy=fail_strategy)
with (
patch("onyx.hooks.executor.HOOKS_AVAILABLE", True),
patch("ee.onyx.hooks.executor.MULTI_TENANT", False),
patch(
"onyx.hooks.executor.get_non_deleted_hook_by_hook_point",
"ee.onyx.hooks.executor.get_non_deleted_hook_by_hook_point",
return_value=hook,
),
patch(
"onyx.hooks.executor._execute_hook_inner",
"ee.onyx.hooks.executor._execute_hook_inner",
side_effect=ValueError("unexpected bug"),
),
):
@@ -658,17 +668,19 @@ def test_is_reachable_failure_does_not_prevent_log(db_session: MagicMock) -> Non
hook = _make_hook(fail_strategy=HookFailStrategy.SOFT)
with (
patch("onyx.hooks.executor.HOOKS_AVAILABLE", True),
patch("ee.onyx.hooks.executor.MULTI_TENANT", False),
patch(
"onyx.hooks.executor.get_non_deleted_hook_by_hook_point",
"ee.onyx.hooks.executor.get_non_deleted_hook_by_hook_point",
return_value=hook,
),
patch("onyx.hooks.executor.get_session_with_current_tenant"),
patch("ee.onyx.hooks.executor.get_session_with_current_tenant"),
patch(
"onyx.hooks.executor.update_hook__no_commit",
"ee.onyx.hooks.executor.update_hook__no_commit",
side_effect=OnyxError(OnyxErrorCode.NOT_FOUND, "hook deleted"),
),
patch("onyx.hooks.executor.create_hook_execution_log__no_commit") as mock_log,
patch(
"ee.onyx.hooks.executor.create_hook_execution_log__no_commit"
) as mock_log,
patch("httpx.Client") as mock_client_cls,
):
_setup_client(mock_client_cls, side_effect=httpx.ConnectError("refused"))

View File

@@ -1,4 +1,4 @@
"""Unit tests for onyx.server.features.hooks.api helpers.
"""Unit tests for ee.onyx.server.features.hooks.api helpers.
Covers:
- _check_ssrf_safety: scheme enforcement and private-IP blocklist
@@ -16,13 +16,13 @@ from unittest.mock import patch
import httpx
import pytest
from ee.onyx.server.features.hooks.api import _check_ssrf_safety
from ee.onyx.server.features.hooks.api import _raise_for_validation_failure
from ee.onyx.server.features.hooks.api import _validate_endpoint
from onyx.error_handling.error_codes import OnyxErrorCode
from onyx.error_handling.exceptions import OnyxError
from onyx.hooks.models import HookValidateResponse
from onyx.hooks.models import HookValidateStatus
from onyx.server.features.hooks.api import _check_ssrf_safety
from onyx.server.features.hooks.api import _raise_for_validation_failure
from onyx.server.features.hooks.api import _validate_endpoint
# ---------------------------------------------------------------------------
# Helpers
@@ -117,28 +117,28 @@ class TestCheckSsrfSafety:
class TestValidateEndpoint:
def _call(self, *, api_key: str | None = _API_KEY) -> HookValidateResponse:
# Bypass SSRF check — tested separately in TestCheckSsrfSafety.
with patch("onyx.server.features.hooks.api._check_ssrf_safety"):
with patch("ee.onyx.server.features.hooks.api._check_ssrf_safety"):
return _validate_endpoint(
endpoint_url=_URL,
api_key=api_key,
timeout_seconds=_TIMEOUT,
)
@patch("onyx.server.features.hooks.api.httpx.Client")
@patch("ee.onyx.server.features.hooks.api.httpx.Client")
def test_2xx_returns_passed(self, mock_client_cls: MagicMock) -> None:
mock_client_cls.return_value.__enter__.return_value.post.return_value = (
_mock_response(200)
)
assert self._call().status == HookValidateStatus.passed
@patch("onyx.server.features.hooks.api.httpx.Client")
@patch("ee.onyx.server.features.hooks.api.httpx.Client")
def test_5xx_returns_passed(self, mock_client_cls: MagicMock) -> None:
mock_client_cls.return_value.__enter__.return_value.post.return_value = (
_mock_response(500)
)
assert self._call().status == HookValidateStatus.passed
@patch("onyx.server.features.hooks.api.httpx.Client")
@patch("ee.onyx.server.features.hooks.api.httpx.Client")
@pytest.mark.parametrize("status_code", [401, 403])
def test_401_403_returns_auth_failed(
self, mock_client_cls: MagicMock, status_code: int
@@ -150,21 +150,21 @@ class TestValidateEndpoint:
assert result.status == HookValidateStatus.auth_failed
assert str(status_code) in (result.error_message or "")
@patch("onyx.server.features.hooks.api.httpx.Client")
@patch("ee.onyx.server.features.hooks.api.httpx.Client")
def test_4xx_non_auth_returns_passed(self, mock_client_cls: MagicMock) -> None:
mock_client_cls.return_value.__enter__.return_value.post.return_value = (
_mock_response(422)
)
assert self._call().status == HookValidateStatus.passed
@patch("onyx.server.features.hooks.api.httpx.Client")
@patch("ee.onyx.server.features.hooks.api.httpx.Client")
def test_connect_timeout_returns_timeout(self, mock_client_cls: MagicMock) -> None:
mock_client_cls.return_value.__enter__.return_value.post.side_effect = (
httpx.ConnectTimeout("timed out")
)
assert self._call().status == HookValidateStatus.timeout
@patch("onyx.server.features.hooks.api.httpx.Client")
@patch("ee.onyx.server.features.hooks.api.httpx.Client")
@pytest.mark.parametrize(
"exc",
[
@@ -179,7 +179,7 @@ class TestValidateEndpoint:
mock_client_cls.return_value.__enter__.return_value.post.side_effect = exc
assert self._call().status == HookValidateStatus.timeout
@patch("onyx.server.features.hooks.api.httpx.Client")
@patch("ee.onyx.server.features.hooks.api.httpx.Client")
def test_connect_error_returns_cannot_connect(
self, mock_client_cls: MagicMock
) -> None:
@@ -189,7 +189,7 @@ class TestValidateEndpoint:
)
assert self._call().status == HookValidateStatus.cannot_connect
@patch("onyx.server.features.hooks.api.httpx.Client")
@patch("ee.onyx.server.features.hooks.api.httpx.Client")
def test_arbitrary_exception_returns_cannot_connect(
self, mock_client_cls: MagicMock
) -> None:
@@ -198,7 +198,7 @@ class TestValidateEndpoint:
)
assert self._call().status == HookValidateStatus.cannot_connect
@patch("onyx.server.features.hooks.api.httpx.Client")
@patch("ee.onyx.server.features.hooks.api.httpx.Client")
def test_api_key_sent_as_bearer(self, mock_client_cls: MagicMock) -> None:
mock_post = mock_client_cls.return_value.__enter__.return_value.post
mock_post.return_value = _mock_response(200)
@@ -206,7 +206,7 @@ class TestValidateEndpoint:
_, kwargs = mock_post.call_args
assert kwargs["headers"]["Authorization"] == "Bearer mykey"
@patch("onyx.server.features.hooks.api.httpx.Client")
@patch("ee.onyx.server.features.hooks.api.httpx.Client")
def test_no_api_key_omits_auth_header(self, mock_client_cls: MagicMock) -> None:
mock_post = mock_client_cls.return_value.__enter__.return_value.post
mock_post.return_value = _mock_response(200)

View File

@@ -300,6 +300,66 @@ class TestExtractContextFiles:
assert result.file_texts == []
assert result.total_token_count == 50
@patch("onyx.chat.process_message.load_in_memory_chat_files")
def test_tool_metadata_file_id_matches_chat_history_file_id(
self, mock_load: MagicMock
) -> None:
"""The file_id in tool metadata (from extract_context_files) and the
file_id in chat history messages (from build_file_context) must
agree, otherwise the LLM sees different IDs for the same file across
turns.
In production, UserFile.id (UUID PK) differs from UserFile.file_id
(file-store path). Both pathways should produce the same file_id
(UserFile.id) for FileReaderTool."""
from onyx.chat.chat_utils import build_file_context
user_file_uuid = uuid4()
file_store_path = f"user_files/{user_file_uuid}/data.csv"
uf = UserFile(
id=user_file_uuid,
file_id=file_store_path,
name="data.csv",
token_count=100,
file_type="text/csv",
)
in_memory = InMemoryChatFile(
file_id=file_store_path,
content=b"col1,col2\na,b",
file_type=ChatFileType.TABULAR,
filename="data.csv",
)
mock_load.return_value = [in_memory]
# Pathway 1: extract_context_files (project/persona context)
result = extract_context_files(
user_files=[uf],
llm_max_context_window=10000,
reserved_token_count=0,
db_session=MagicMock(),
)
assert len(result.file_metadata_for_tool) == 1
tool_metadata_file_id = result.file_metadata_for_tool[0].file_id
# Pathway 2: build_file_context (chat history path)
# In convert_chat_history, tool_file_id comes from
# file_descriptor["user_file_id"], which is str(UserFile.id)
ctx = build_file_context(
tool_file_id=str(user_file_uuid),
filename="data.csv",
file_type=ChatFileType.TABULAR,
)
chat_history_file_id = ctx.tool_metadata.file_id
# Both pathways must produce the same ID for the LLM
assert tool_metadata_file_id == chat_history_file_id, (
f"File ID mismatch: extract_context_files uses '{tool_metadata_file_id}' "
f"but build_file_context uses '{chat_history_file_id}'."
)
@patch("onyx.chat.process_message.DISABLE_VECTOR_DB", True)
def test_overflow_with_vector_db_disabled_provides_tool_metadata(self) -> None:
"""When vector DB is disabled, overflow produces FileToolMetadata."""
@@ -316,6 +376,128 @@ class TestExtractContextFiles:
assert len(result.file_metadata_for_tool) == 1
assert result.file_metadata_for_tool[0].filename == "bigfile.txt"
@patch("onyx.chat.process_message.load_in_memory_chat_files")
def test_metadata_only_files_not_counted_in_aggregate_tokens(
self, mock_load: MagicMock
) -> None:
"""Metadata-only files (TABULAR) should not count toward the token budget."""
text_file_id = str(uuid4())
text_uf = _make_user_file(token_count=100, file_id=text_file_id)
# TABULAR file with large token count — should be excluded from aggregate
tabular_uf = _make_user_file(
token_count=50000, name="huge.xlsx", file_id=str(uuid4())
)
tabular_uf.file_type = (
"application/vnd.openxmlformats-officedocument.spreadsheetml.sheet"
)
mock_load.return_value = [
_make_in_memory_file(file_id=text_file_id, content="text content"),
InMemoryChatFile(
file_id=str(tabular_uf.id),
content=b"binary xlsx",
file_type=ChatFileType.TABULAR,
filename="huge.xlsx",
),
]
result = extract_context_files(
user_files=[text_uf, tabular_uf],
llm_max_context_window=10000,
reserved_token_count=0,
db_session=MagicMock(),
)
# Text file fits (100 < 6000), so files should be loaded
assert result.file_texts == ["text content"]
# TABULAR file should appear as tool metadata, not in file_texts
assert len(result.file_metadata_for_tool) == 1
assert result.file_metadata_for_tool[0].filename == "huge.xlsx"
@patch("onyx.chat.process_message.load_in_memory_chat_files")
def test_metadata_only_files_loaded_as_tool_metadata(
self, mock_load: MagicMock
) -> None:
"""When files fit, metadata-only files appear in file_metadata_for_tool."""
text_file_id = str(uuid4())
tabular_file_id = str(uuid4())
text_uf = _make_user_file(token_count=100, file_id=text_file_id)
tabular_uf = _make_user_file(
token_count=500, name="data.csv", file_id=tabular_file_id
)
tabular_uf.file_type = "text/csv"
mock_load.return_value = [
_make_in_memory_file(file_id=text_file_id, content="hello"),
InMemoryChatFile(
file_id=tabular_file_id,
content=b"col1,col2\na,b",
file_type=ChatFileType.TABULAR,
filename="data.csv",
),
]
result = extract_context_files(
user_files=[text_uf, tabular_uf],
llm_max_context_window=10000,
reserved_token_count=0,
db_session=MagicMock(),
)
assert result.file_texts == ["hello"]
assert len(result.file_metadata_for_tool) == 1
assert result.file_metadata_for_tool[0].filename == "data.csv"
# TABULAR should not appear in file_metadata (that's for citation)
assert all(m.filename != "data.csv" for m in result.file_metadata)
def test_overflow_with_vector_db_preserves_metadata_only_tool_metadata(
self,
) -> None:
"""When text files overflow with vector DB enabled, metadata-only files
should still be exposed via file_metadata_for_tool since they aren't
in the vector DB and would otherwise be inaccessible."""
text_uf = _make_user_file(token_count=7000, name="bigfile.txt")
tabular_uf = _make_user_file(token_count=500, name="data.xlsx")
tabular_uf.file_type = (
"application/vnd.openxmlformats-officedocument.spreadsheetml.sheet"
)
result = extract_context_files(
user_files=[text_uf, tabular_uf],
llm_max_context_window=10000,
reserved_token_count=0,
db_session=MagicMock(),
)
# Text files overflow → search filter enabled
assert result.use_as_search_filter is True
assert result.file_texts == []
# TABULAR file should still be in tool metadata
assert len(result.file_metadata_for_tool) == 1
assert result.file_metadata_for_tool[0].filename == "data.xlsx"
@patch("onyx.chat.process_message.DISABLE_VECTOR_DB", True)
def test_overflow_no_vector_db_includes_all_files_in_tool_metadata(self) -> None:
"""When vector DB is disabled and files overflow, all files
(both text and metadata-only) appear in file_metadata_for_tool."""
text_uf = _make_user_file(token_count=7000, name="bigfile.txt")
tabular_uf = _make_user_file(token_count=500, name="data.xlsx")
tabular_uf.file_type = (
"application/vnd.openxmlformats-officedocument.spreadsheetml.sheet"
)
result = extract_context_files(
user_files=[text_uf, tabular_uf],
llm_max_context_window=10000,
reserved_token_count=0,
db_session=MagicMock(),
)
assert result.use_as_search_filter is False
assert len(result.file_metadata_for_tool) == 2
filenames = {m.filename for m in result.file_metadata_for_tool}
assert filenames == {"bigfile.txt", "data.xlsx"}
# ===========================================================================
# Search filter + search_usage determination

View File

@@ -644,6 +644,92 @@ class TestConstructMessageHistory:
assert "Project file 0 content" in project_message.message
assert "Project file 1 content" in project_message.message
def test_file_metadata_for_tool_produces_message(self) -> None:
"""When context_files has file_metadata_for_tool, a metadata listing
message should be injected into the history."""
system_prompt = create_message("System", MessageType.SYSTEM, 10)
user_msg = create_message("Analyze the spreadsheet", MessageType.USER, 5)
context_files = ExtractedContextFiles(
file_texts=[],
image_files=[],
use_as_search_filter=False,
total_token_count=0,
file_metadata=[],
uncapped_token_count=0,
file_metadata_for_tool=[
FileToolMetadata(
file_id="xlsx-1",
filename="report.xlsx",
approx_char_count=100000,
),
],
)
result = construct_message_history(
system_prompt=system_prompt,
custom_agent_prompt=None,
simple_chat_history=[user_msg],
reminder_message=None,
context_files=context_files,
available_tokens=1000,
token_counter=_simple_token_counter,
)
# Should have: system, tool_metadata_message, user
assert len(result) == 3
metadata_msg = result[1]
assert metadata_msg.message_type == MessageType.USER
assert "report.xlsx" in metadata_msg.message
assert "xlsx-1" in metadata_msg.message
def test_metadata_only_and_text_files_both_present(self) -> None:
"""When both text content and tool metadata are present, both messages
should appear in the history."""
system_prompt = create_message("System", MessageType.SYSTEM, 10)
user_msg = create_message("Summarize everything", MessageType.USER, 5)
context_files = ExtractedContextFiles(
file_texts=["Text file content here"],
image_files=[],
use_as_search_filter=False,
total_token_count=100,
file_metadata=[
ContextFileMetadata(
file_id="txt-1",
filename="notes.txt",
file_content="Text file content here",
),
],
uncapped_token_count=100,
file_metadata_for_tool=[
FileToolMetadata(
file_id="xlsx-1",
filename="data.xlsx",
approx_char_count=50000,
),
],
)
result = construct_message_history(
system_prompt=system_prompt,
custom_agent_prompt=None,
simple_chat_history=[user_msg],
reminder_message=None,
context_files=context_files,
available_tokens=2000,
token_counter=_simple_token_counter,
)
# Should have: system, context_files_message, tool_metadata_message, user
assert len(result) == 4
# Context files message (text content)
assert "documents" in result[1].message
assert "Text file content here" in result[1].message
# Tool metadata message
assert "data.xlsx" in result[2].message
assert result[3] == user_msg
def _simple_token_counter(text: str) -> int:
"""Approximate token counter for tests (~4 chars per token)."""

View File

@@ -556,14 +556,13 @@ class TestRunModels:
mock_handle.assert_not_called()
def test_http_disconnect_completion_via_generator_exit(self) -> None:
"""GeneratorExit from HTTP disconnect triggers worker self-completion.
"""GeneratorExit from HTTP disconnect triggers main-thread completion.
When the HTTP client closes the connection, Starlette throws GeneratorExit
into the stream generator. The finally block sets drain_done (signalling
emitters to stop blocking) and calls executor.shutdown(wait=False) so the
server thread is never blocked. Worker threads detect drain_done.is_set()
after run_llm_loop completes and self-persist the result via
llm_loop_completion_handle using their own DB session.
emitters to stop blocking), waits for workers via executor.shutdown(wait=True),
then calls llm_loop_completion_handle for each successful model from the main
thread.
This is the primary regression for test_send_message_disconnect_and_cleanup:
the integration test disconnects mid-stream and expects the DB message to be
@@ -571,24 +570,20 @@ class TestRunModels:
"""
import threading
# Signals the worker to unblock from run_llm_loop after gen.close() returns.
# This guarantees drain_done is set BEFORE the worker returns from run_llm_loop,
# so the self-completion path (drain_done.is_set() check) is always taken.
disconnect_received = threading.Event()
# Set by the llm_loop_completion_handle mock when called.
completion_called = threading.Event()
def emit_then_complete(**kwargs: Any) -> None:
def emit_then_block_until_drain(**kwargs: Any) -> None:
"""Emit one packet (to give the drain loop a yield point), then block
until the main thread signals that gen.close() has been called. This
ensures drain_done is set before we return so model_succeeded is checked
against a set drain_done — no race condition.
until drain_done is set — simulating a mid-stream LLM call that exits
promptly once the emitter signals shutdown.
"""
emitter = kwargs["emitter"]
emitter.emit(
Packet(placement=Placement(turn_index=0), obj=ReasoningStart())
)
disconnect_received.wait(timeout=5)
# Block until drain_done is set by gen.close(). The Emitter's _drain_done
# is the same Event that _run_models sets, so this unblocks promptly.
emitter._drain_done.wait(timeout=5)
setup = _make_setup(n_models=1)
# is_connected() always True — HTTP disconnect does NOT set the Redis stop fence.
@@ -597,7 +592,7 @@ class TestRunModels:
with (
patch(
"onyx.chat.process_message.run_llm_loop",
side_effect=emit_then_complete,
side_effect=emit_then_block_until_drain,
),
patch("onyx.chat.process_message.run_deep_research_llm_loop"),
patch("onyx.chat.process_message.construct_tools", return_value={}),
@@ -613,34 +608,25 @@ class TestRunModels:
):
from onyx.chat.process_message import _run_models
# cast to Generator so .close() is available; _run_models returns
# AnswerStream (= Iterator) but the actual object is always a generator.
gen = cast(Generator, _run_models(setup, MagicMock(), MagicMock()))
# Advance to the first yielded packet — generator suspends at `yield item`.
first = next(gen)
assert isinstance(first, Packet)
# Simulate Starlette closing the stream on HTTP client disconnect.
# GeneratorExit is thrown at the `yield item` suspension point.
# gen.close() → GeneratorExit → finally → drain_done.set() →
# executor.shutdown(wait=True) → main thread completes models.
gen.close()
# Unblock the worker now that drain_done has been set by gen.close().
disconnect_received.set()
# Worker self-completes asynchronously (executor.shutdown(wait=False)).
# Wait here, inside the patch context, so that get_session_with_current_tenant
# and llm_loop_completion_handle mocks are still active when the worker calls them.
assert completion_called.wait(
timeout=5
), "worker must self-complete via drain_done within 5 seconds"
assert (
mock_handle.call_count == 1
), "completion handle must be called once for the successful model"
completion_called.is_set()
), "main thread must call completion for the successful model"
assert mock_handle.call_count == 1
def test_b1_race_disconnect_handler_completes_already_finished_model(self) -> None:
"""B1 regression: model finishes BEFORE GeneratorExit fires.
The worker exits _run_model with drain_done.is_set()=False and skips
self-completion. When gen.close() fires afterward, the finally else-branch
must detect model_succeeded=True and call llm_loop_completion_handle itself.
The worker exits _run_model before drain_done is set. When gen.close()
fires afterward, the finally block sets drain_done, waits for workers
(already done), then the main thread calls llm_loop_completion_handle.
Contrast with test_http_disconnect_completion_via_generator_exit, which
tests the opposite ordering (worker finishes AFTER disconnect).

View File

@@ -139,7 +139,7 @@ def test_csv_file_type() -> None:
result = _extract_referenced_file_descriptors([tool_call], message)
assert len(result) == 1
assert result[0]["type"] == ChatFileType.CSV
assert result[0]["type"] == ChatFileType.TABULAR
def test_unknown_extension_defaults_to_plain_text() -> None:

View File

@@ -1,15 +1,23 @@
"""Tests for Canvas connector — client (PR1)."""
"""Tests for Canvas connector — client, credentials, conversion."""
from datetime import datetime
from datetime import timezone
from typing import Any
from unittest.mock import MagicMock
from unittest.mock import patch
import pytest
from onyx.configs.constants import DocumentSource
from onyx.connectors.canvas.client import CanvasApiClient
from onyx.connectors.canvas.connector import CanvasConnector
from onyx.connectors.exceptions import ConnectorValidationError
from onyx.connectors.exceptions import CredentialExpiredError
from onyx.connectors.exceptions import InsufficientPermissionsError
from onyx.connectors.exceptions import UnexpectedValidationError
from onyx.connectors.models import ConnectorMissingCredentialError
from onyx.error_handling.exceptions import OnyxError
# ---------------------------------------------------------------------------
# Helpers
# ---------------------------------------------------------------------------
@@ -18,6 +26,77 @@ FAKE_BASE_URL = "https://myschool.instructure.com"
FAKE_TOKEN = "fake-canvas-token"
def _mock_course(
course_id: int = 1,
name: str = "Intro to CS",
course_code: str = "CS101",
) -> dict[str, Any]:
return {
"id": course_id,
"name": name,
"course_code": course_code,
"created_at": "2025-01-01T00:00:00Z",
"workflow_state": "available",
}
def _build_connector(base_url: str = FAKE_BASE_URL) -> CanvasConnector:
"""Build a connector with mocked credential validation."""
with patch("onyx.connectors.canvas.client.rl_requests") as mock_req:
mock_req.get.return_value = _mock_response(json_data=[_mock_course()])
connector = CanvasConnector(canvas_base_url=base_url)
connector.load_credentials({"canvas_access_token": FAKE_TOKEN})
return connector
def _mock_page(
page_id: int = 10,
title: str = "Syllabus",
updated_at: str = "2025-06-01T12:00:00Z",
) -> dict[str, Any]:
return {
"page_id": page_id,
"url": "syllabus",
"title": title,
"body": "<p>Welcome to the course</p>",
"created_at": "2025-01-15T00:00:00Z",
"updated_at": updated_at,
}
def _mock_assignment(
assignment_id: int = 20,
name: str = "Homework 1",
course_id: int = 1,
updated_at: str = "2025-06-01T12:00:00Z",
) -> dict[str, Any]:
return {
"id": assignment_id,
"name": name,
"description": "<p>Solve these problems</p>",
"html_url": f"{FAKE_BASE_URL}/courses/{course_id}/assignments/{assignment_id}",
"course_id": course_id,
"created_at": "2025-01-20T00:00:00Z",
"updated_at": updated_at,
"due_at": "2025-02-01T23:59:00Z",
}
def _mock_announcement(
announcement_id: int = 30,
title: str = "Class Cancelled",
course_id: int = 1,
posted_at: str = "2025-06-01T12:00:00Z",
) -> dict[str, Any]:
return {
"id": announcement_id,
"title": title,
"message": "<p>No class today</p>",
"html_url": f"{FAKE_BASE_URL}/courses/{course_id}/discussion_topics/{announcement_id}",
"posted_at": posted_at,
}
def _mock_response(
status_code: int = 200,
json_data: Any = None,
@@ -325,6 +404,57 @@ class TestGet:
assert result == expected
# ---------------------------------------------------------------------------
# CanvasApiClient.paginate tests
# ---------------------------------------------------------------------------
class TestPaginate:
@patch("onyx.connectors.canvas.client.rl_requests")
def test_single_page(self, mock_requests: MagicMock) -> None:
mock_requests.get.return_value = _mock_response(
json_data=[{"id": 1}, {"id": 2}]
)
client = CanvasApiClient(
bearer_token=FAKE_TOKEN,
canvas_base_url=FAKE_BASE_URL,
)
pages = list(client.paginate("courses"))
assert len(pages) == 1
assert pages[0] == [{"id": 1}, {"id": 2}]
@patch("onyx.connectors.canvas.client.rl_requests")
def test_two_pages(self, mock_requests: MagicMock) -> None:
next_link = f'<{FAKE_BASE_URL}/api/v1/courses?page=2>; rel="next"'
page1 = _mock_response(json_data=[{"id": 1}], link_header=next_link)
page2 = _mock_response(json_data=[{"id": 2}])
mock_requests.get.side_effect = [page1, page2]
client = CanvasApiClient(
bearer_token=FAKE_TOKEN,
canvas_base_url=FAKE_BASE_URL,
)
pages = list(client.paginate("courses"))
assert len(pages) == 2
assert pages[0] == [{"id": 1}]
assert pages[1] == [{"id": 2}]
@patch("onyx.connectors.canvas.client.rl_requests")
def test_empty_response(self, mock_requests: MagicMock) -> None:
mock_requests.get.return_value = _mock_response(json_data=[])
client = CanvasApiClient(
bearer_token=FAKE_TOKEN,
canvas_base_url=FAKE_BASE_URL,
)
pages = list(client.paginate("courses"))
assert pages == []
# ---------------------------------------------------------------------------
# CanvasApiClient._parse_next_link tests
# ---------------------------------------------------------------------------
@@ -379,3 +509,368 @@ class TestParseNextLink:
with pytest.raises(OnyxError, match="must use https"):
self.client._parse_next_link(header)
# ---------------------------------------------------------------------------
# CanvasConnector — credential loading
# ---------------------------------------------------------------------------
class TestLoadCredentials:
def _assert_load_credentials_raises(
self,
status_code: int,
expected_error: type[Exception],
mock_requests: MagicMock,
) -> None:
"""Helper: assert load_credentials raises expected_error for a given status."""
mock_requests.get.return_value = _mock_response(status_code, {})
connector = CanvasConnector(canvas_base_url=FAKE_BASE_URL)
with pytest.raises(expected_error):
connector.load_credentials({"canvas_access_token": FAKE_TOKEN})
@patch("onyx.connectors.canvas.client.rl_requests")
def test_load_credentials_success(self, mock_requests: MagicMock) -> None:
mock_requests.get.return_value = _mock_response(json_data=[_mock_course()])
connector = CanvasConnector(canvas_base_url=FAKE_BASE_URL)
result = connector.load_credentials({"canvas_access_token": FAKE_TOKEN})
assert result is None
assert connector._canvas_client is not None
def test_canvas_client_raises_without_credentials(self) -> None:
connector = CanvasConnector(canvas_base_url=FAKE_BASE_URL)
with pytest.raises(ConnectorMissingCredentialError):
_ = connector.canvas_client
@patch("onyx.connectors.canvas.client.rl_requests")
def test_load_credentials_invalid_token(self, mock_requests: MagicMock) -> None:
self._assert_load_credentials_raises(401, CredentialExpiredError, mock_requests)
@patch("onyx.connectors.canvas.client.rl_requests")
def test_load_credentials_insufficient_permissions(
self, mock_requests: MagicMock
) -> None:
self._assert_load_credentials_raises(
403, InsufficientPermissionsError, mock_requests
)
# ---------------------------------------------------------------------------
# CanvasConnector — URL normalization
# ---------------------------------------------------------------------------
class TestConnectorUrlNormalization:
def test_strips_api_v1_suffix(self) -> None:
connector = _build_connector(base_url=f"{FAKE_BASE_URL}/api/v1")
result = connector.canvas_base_url
expected = FAKE_BASE_URL
assert result == expected
def test_strips_trailing_slash(self) -> None:
connector = _build_connector(base_url=f"{FAKE_BASE_URL}/")
result = connector.canvas_base_url
expected = FAKE_BASE_URL
assert result == expected
def test_no_change_for_clean_url(self) -> None:
connector = _build_connector(base_url=FAKE_BASE_URL)
result = connector.canvas_base_url
expected = FAKE_BASE_URL
assert result == expected
# ---------------------------------------------------------------------------
# CanvasConnector — document conversion
# ---------------------------------------------------------------------------
class TestDocumentConversion:
def setup_method(self) -> None:
self.connector = _build_connector()
def test_convert_page_to_document(self) -> None:
from onyx.connectors.canvas.connector import CanvasPage
page = CanvasPage(
page_id=10,
url="syllabus",
title="Syllabus",
body="<p>Welcome</p>",
created_at="2025-01-15T00:00:00Z",
updated_at="2025-06-01T12:00:00Z",
course_id=1,
)
doc = self.connector._convert_page_to_document(page)
expected_id = "canvas-page-1-10"
expected_metadata = {"course_id": "1", "type": "page"}
expected_updated_at = datetime(2025, 6, 1, 12, 0, tzinfo=timezone.utc)
assert doc.id == expected_id
assert doc.source == DocumentSource.CANVAS
assert doc.semantic_identifier == "Syllabus"
assert doc.metadata == expected_metadata
assert doc.sections[0].link is not None
assert f"{FAKE_BASE_URL}/courses/1/pages/syllabus" in doc.sections[0].link
assert doc.doc_updated_at == expected_updated_at
def test_convert_page_without_body(self) -> None:
from onyx.connectors.canvas.connector import CanvasPage
page = CanvasPage(
page_id=11,
url="empty-page",
title="Empty Page",
body=None,
created_at="2025-01-15T00:00:00Z",
updated_at="2025-06-01T12:00:00Z",
course_id=1,
)
doc = self.connector._convert_page_to_document(page)
section_text = doc.sections[0].text
assert section_text is not None
assert "Empty Page" in section_text
assert "<p>" not in section_text
def test_convert_assignment_to_document(self) -> None:
from onyx.connectors.canvas.connector import CanvasAssignment
assignment = CanvasAssignment(
id=20,
name="Homework 1",
description="<p>Solve these</p>",
html_url=f"{FAKE_BASE_URL}/courses/1/assignments/20",
course_id=1,
created_at="2025-01-20T00:00:00Z",
updated_at="2025-06-01T12:00:00Z",
due_at="2025-02-01T23:59:00Z",
)
doc = self.connector._convert_assignment_to_document(assignment)
expected_id = "canvas-assignment-1-20"
expected_due_text = "Due: February 01, 2025 23:59 UTC"
assert doc.id == expected_id
assert doc.source == DocumentSource.CANVAS
assert doc.semantic_identifier == "Homework 1"
assert doc.sections[0].text is not None
assert expected_due_text in doc.sections[0].text
def test_convert_assignment_without_description(self) -> None:
from onyx.connectors.canvas.connector import CanvasAssignment
assignment = CanvasAssignment(
id=21,
name="Quiz 1",
description=None,
html_url=f"{FAKE_BASE_URL}/courses/1/assignments/21",
course_id=1,
created_at="2025-01-20T00:00:00Z",
updated_at="2025-06-01T12:00:00Z",
due_at=None,
)
doc = self.connector._convert_assignment_to_document(assignment)
section_text = doc.sections[0].text
assert section_text is not None
assert "Quiz 1" in section_text
assert "Due:" not in section_text
def test_convert_announcement_to_document(self) -> None:
from onyx.connectors.canvas.connector import CanvasAnnouncement
announcement = CanvasAnnouncement(
id=30,
title="Class Cancelled",
message="<p>No class today</p>",
html_url=f"{FAKE_BASE_URL}/courses/1/discussion_topics/30",
posted_at="2025-06-01T12:00:00Z",
course_id=1,
)
doc = self.connector._convert_announcement_to_document(announcement)
expected_id = "canvas-announcement-1-30"
expected_updated_at = datetime(2025, 6, 1, 12, 0, tzinfo=timezone.utc)
assert doc.id == expected_id
assert doc.source == DocumentSource.CANVAS
assert doc.semantic_identifier == "Class Cancelled"
assert doc.doc_updated_at == expected_updated_at
def test_convert_announcement_without_posted_at(self) -> None:
from onyx.connectors.canvas.connector import CanvasAnnouncement
announcement = CanvasAnnouncement(
id=31,
title="TBD Announcement",
message=None,
html_url=f"{FAKE_BASE_URL}/courses/1/discussion_topics/31",
posted_at=None,
course_id=1,
)
doc = self.connector._convert_announcement_to_document(announcement)
assert doc.doc_updated_at is None
# ---------------------------------------------------------------------------
# CanvasConnector — validate_connector_settings
# ---------------------------------------------------------------------------
class TestValidateConnectorSettings:
def _assert_validate_raises(
self,
status_code: int,
expected_error: type[Exception],
mock_requests: MagicMock,
) -> None:
"""Helper: assert validate_connector_settings raises expected_error."""
success_resp = _mock_response(json_data=[_mock_course()])
fail_resp = _mock_response(status_code, {})
mock_requests.get.side_effect = [success_resp, fail_resp]
connector = CanvasConnector(canvas_base_url=FAKE_BASE_URL)
connector.load_credentials({"canvas_access_token": FAKE_TOKEN})
with pytest.raises(expected_error):
connector.validate_connector_settings()
@patch("onyx.connectors.canvas.client.rl_requests")
def test_validate_success(self, mock_requests: MagicMock) -> None:
mock_requests.get.return_value = _mock_response(json_data=[_mock_course()])
connector = _build_connector()
connector.validate_connector_settings() # should not raise
@patch("onyx.connectors.canvas.client.rl_requests")
def test_validate_expired_credential(self, mock_requests: MagicMock) -> None:
self._assert_validate_raises(401, CredentialExpiredError, mock_requests)
@patch("onyx.connectors.canvas.client.rl_requests")
def test_validate_insufficient_permissions(self, mock_requests: MagicMock) -> None:
self._assert_validate_raises(403, InsufficientPermissionsError, mock_requests)
@patch("onyx.connectors.canvas.client.rl_requests")
def test_validate_rate_limited(self, mock_requests: MagicMock) -> None:
self._assert_validate_raises(429, ConnectorValidationError, mock_requests)
@patch("onyx.connectors.canvas.client.rl_requests")
def test_validate_unexpected_error(self, mock_requests: MagicMock) -> None:
self._assert_validate_raises(500, UnexpectedValidationError, mock_requests)
# ---------------------------------------------------------------------------
# _list_* pagination tests
# ---------------------------------------------------------------------------
class TestListCourses:
@patch("onyx.connectors.canvas.client.rl_requests")
def test_single_page(self, mock_requests: MagicMock) -> None:
mock_requests.get.return_value = _mock_response(
json_data=[_mock_course(1), _mock_course(2, "CS201", "Data Structures")]
)
connector = _build_connector()
result = connector._list_courses()
assert len(result) == 2
assert result[0].id == 1
assert result[1].id == 2
@patch("onyx.connectors.canvas.client.rl_requests")
def test_empty_response(self, mock_requests: MagicMock) -> None:
mock_requests.get.return_value = _mock_response(json_data=[])
connector = _build_connector()
result = connector._list_courses()
assert result == []
class TestListPages:
@patch("onyx.connectors.canvas.client.rl_requests")
def test_single_page(self, mock_requests: MagicMock) -> None:
mock_requests.get.return_value = _mock_response(
json_data=[_mock_page(10), _mock_page(11, "Notes")]
)
connector = _build_connector()
result = connector._list_pages(course_id=1)
assert len(result) == 2
assert result[0].page_id == 10
assert result[1].page_id == 11
@patch("onyx.connectors.canvas.client.rl_requests")
def test_empty_response(self, mock_requests: MagicMock) -> None:
mock_requests.get.return_value = _mock_response(json_data=[])
connector = _build_connector()
result = connector._list_pages(course_id=1)
assert result == []
class TestListAssignments:
@patch("onyx.connectors.canvas.client.rl_requests")
def test_single_page(self, mock_requests: MagicMock) -> None:
mock_requests.get.return_value = _mock_response(
json_data=[_mock_assignment(20), _mock_assignment(21, "Quiz 1")]
)
connector = _build_connector()
result = connector._list_assignments(course_id=1)
assert len(result) == 2
assert result[0].id == 20
assert result[1].id == 21
@patch("onyx.connectors.canvas.client.rl_requests")
def test_empty_response(self, mock_requests: MagicMock) -> None:
mock_requests.get.return_value = _mock_response(json_data=[])
connector = _build_connector()
result = connector._list_assignments(course_id=1)
assert result == []
class TestListAnnouncements:
@patch("onyx.connectors.canvas.client.rl_requests")
def test_single_page(self, mock_requests: MagicMock) -> None:
mock_requests.get.return_value = _mock_response(
json_data=[_mock_announcement(30), _mock_announcement(31, "Update")]
)
connector = _build_connector()
result = connector._list_announcements(course_id=1)
assert len(result) == 2
assert result[0].id == 30
assert result[1].id == 31
@patch("onyx.connectors.canvas.client.rl_requests")
def test_empty_response(self, mock_requests: MagicMock) -> None:
mock_requests.get.return_value = _mock_response(json_data=[])
connector = _build_connector()
result = connector._list_announcements(course_id=1)
assert result == []

View File

@@ -0,0 +1,45 @@
from unittest.mock import AsyncMock
from unittest.mock import patch
import pytest
from discord.errors import LoginFailure
from onyx.connectors.discord.connector import DiscordConnector
from onyx.connectors.exceptions import CredentialInvalidError
def _build_connector(token: str = "fake-bot-token") -> DiscordConnector:
connector = DiscordConnector()
connector.load_credentials({"discord_bot_token": token})
return connector
@patch("onyx.connectors.discord.connector.Client.close", new_callable=AsyncMock)
@patch("onyx.connectors.discord.connector.Client.login", new_callable=AsyncMock)
def test_validate_success(
mock_login: AsyncMock,
mock_close: AsyncMock,
) -> None:
connector = _build_connector()
connector.validate_connector_settings()
mock_login.assert_awaited_once_with("fake-bot-token")
mock_close.assert_awaited_once()
@patch("onyx.connectors.discord.connector.Client.close", new_callable=AsyncMock)
@patch(
"onyx.connectors.discord.connector.Client.login",
new_callable=AsyncMock,
side_effect=LoginFailure("Improper token has been passed."),
)
def test_validate_invalid_token(
mock_login: AsyncMock, # noqa: ARG001
mock_close: AsyncMock,
) -> None:
connector = _build_connector(token="bad-token")
with pytest.raises(CredentialInvalidError, match="Invalid Discord bot token"):
connector.validate_connector_settings()
mock_close.assert_awaited_once()

View File

@@ -0,0 +1,225 @@
"""Tests for get_chat_sessions_by_user filtering behavior.
Verifies that failed chat sessions (those with only SYSTEM messages) are
correctly filtered out while preserving recently created sessions, matching
the behavior specified in PR #7233.
"""
from datetime import datetime
from datetime import timedelta
from datetime import timezone
from unittest.mock import MagicMock
from uuid import UUID
from uuid import uuid4
import pytest
from sqlalchemy.orm import Session
from onyx.db.chat import get_chat_sessions_by_user
from onyx.db.models import ChatSession
def _make_session(
user_id: UUID,
time_created: datetime | None = None,
time_updated: datetime | None = None,
description: str = "",
) -> MagicMock:
"""Create a mock ChatSession with the given attributes."""
session = MagicMock(spec=ChatSession)
session.id = uuid4()
session.user_id = user_id
session.time_created = time_created or datetime.now(timezone.utc)
session.time_updated = time_updated or session.time_created
session.description = description
session.deleted = False
session.onyxbot_flow = False
session.project_id = None
return session
@pytest.fixture
def user_id() -> UUID:
return uuid4()
@pytest.fixture
def old_time() -> datetime:
"""A timestamp well outside the 5-minute leeway window."""
return datetime.now(timezone.utc) - timedelta(hours=1)
@pytest.fixture
def recent_time() -> datetime:
"""A timestamp within the 5-minute leeway window."""
return datetime.now(timezone.utc) - timedelta(minutes=2)
class TestGetChatSessionsByUser:
"""Tests for the failed chat filtering logic in get_chat_sessions_by_user."""
def test_filters_out_failed_sessions(
self, user_id: UUID, old_time: datetime
) -> None:
"""Sessions with only SYSTEM messages should be excluded."""
valid_session = _make_session(user_id, time_created=old_time)
failed_session = _make_session(user_id, time_created=old_time)
db_session = MagicMock(spec=Session)
# First execute: returns all sessions
# Second execute: returns only the valid session's ID (has non-system msgs)
mock_result_1 = MagicMock()
mock_result_1.scalars.return_value.all.return_value = [
valid_session,
failed_session,
]
mock_result_2 = MagicMock()
mock_result_2.scalars.return_value.all.return_value = [valid_session.id]
db_session.execute.side_effect = [mock_result_1, mock_result_2]
result = get_chat_sessions_by_user(
user_id=user_id,
deleted=False,
db_session=db_session,
include_failed_chats=False,
)
assert len(result) == 1
assert result[0].id == valid_session.id
def test_keeps_recent_sessions_without_messages(
self, user_id: UUID, recent_time: datetime
) -> None:
"""Recently created sessions should be kept even without messages."""
recent_session = _make_session(user_id, time_created=recent_time)
db_session = MagicMock(spec=Session)
mock_result_1 = MagicMock()
mock_result_1.scalars.return_value.all.return_value = [recent_session]
db_session.execute.side_effect = [mock_result_1]
result = get_chat_sessions_by_user(
user_id=user_id,
deleted=False,
db_session=db_session,
include_failed_chats=False,
)
assert len(result) == 1
assert result[0].id == recent_session.id
# Should only have been called once — no second query needed
# because the recent session is within the leeway window
assert db_session.execute.call_count == 1
def test_include_failed_chats_skips_filtering(
self, user_id: UUID, old_time: datetime
) -> None:
"""When include_failed_chats=True, no filtering should occur."""
session_a = _make_session(user_id, time_created=old_time)
session_b = _make_session(user_id, time_created=old_time)
db_session = MagicMock(spec=Session)
mock_result = MagicMock()
mock_result.scalars.return_value.all.return_value = [session_a, session_b]
db_session.execute.side_effect = [mock_result]
result = get_chat_sessions_by_user(
user_id=user_id,
deleted=False,
db_session=db_session,
include_failed_chats=True,
)
assert len(result) == 2
# Only one DB call — no second query for message validation
assert db_session.execute.call_count == 1
def test_limit_applied_after_filtering(
self, user_id: UUID, old_time: datetime
) -> None:
"""Limit should be applied after filtering, not before."""
sessions = [_make_session(user_id, time_created=old_time) for _ in range(5)]
valid_ids = [s.id for s in sessions[:3]]
db_session = MagicMock(spec=Session)
mock_result_1 = MagicMock()
mock_result_1.scalars.return_value.all.return_value = sessions
mock_result_2 = MagicMock()
mock_result_2.scalars.return_value.all.return_value = valid_ids
db_session.execute.side_effect = [mock_result_1, mock_result_2]
result = get_chat_sessions_by_user(
user_id=user_id,
deleted=False,
db_session=db_session,
include_failed_chats=False,
limit=2,
)
assert len(result) == 2
# Should be the first 2 valid sessions (order preserved)
assert result[0].id == sessions[0].id
assert result[1].id == sessions[1].id
def test_mixed_recent_and_old_sessions(
self, user_id: UUID, old_time: datetime, recent_time: datetime
) -> None:
"""Mix of recent and old sessions should filter correctly."""
old_valid = _make_session(user_id, time_created=old_time)
old_failed = _make_session(user_id, time_created=old_time)
recent_no_msgs = _make_session(user_id, time_created=recent_time)
db_session = MagicMock(spec=Session)
mock_result_1 = MagicMock()
mock_result_1.scalars.return_value.all.return_value = [
old_valid,
old_failed,
recent_no_msgs,
]
mock_result_2 = MagicMock()
mock_result_2.scalars.return_value.all.return_value = [old_valid.id]
db_session.execute.side_effect = [mock_result_1, mock_result_2]
result = get_chat_sessions_by_user(
user_id=user_id,
deleted=False,
db_session=db_session,
include_failed_chats=False,
)
result_ids = {cs.id for cs in result}
assert old_valid.id in result_ids
assert recent_no_msgs.id in result_ids
assert old_failed.id not in result_ids
def test_empty_result(self, user_id: UUID) -> None:
"""No sessions should return empty list without errors."""
db_session = MagicMock(spec=Session)
mock_result = MagicMock()
mock_result.scalars.return_value.all.return_value = []
db_session.execute.side_effect = [mock_result]
result = get_chat_sessions_by_user(
user_id=user_id,
deleted=False,
db_session=db_session,
include_failed_chats=False,
)
assert result == []
assert db_session.execute.call_count == 1

View File

@@ -40,6 +40,8 @@ def test_send_task_includes_expires(
user_files=user_files,
rejected_files=[],
id_to_temp_id={},
skip_indexing_filenames=set(),
indexable_files=user_files,
)
mock_user = MagicMock()

View File

@@ -0,0 +1,198 @@
import io
from typing import cast
import openpyxl
from openpyxl.worksheet.worksheet import Worksheet
from onyx.file_processing.extract_file_text import xlsx_to_text
def _make_xlsx(sheets: dict[str, list[list[str]]]) -> io.BytesIO:
"""Create an in-memory xlsx file from a dict of sheet_name -> matrix of strings."""
wb = openpyxl.Workbook()
if wb.active is not None:
wb.remove(cast(Worksheet, wb.active))
for sheet_name, rows in sheets.items():
ws = wb.create_sheet(title=sheet_name)
for row in rows:
ws.append(row)
buf = io.BytesIO()
wb.save(buf)
buf.seek(0)
return buf
class TestXlsxToText:
def test_single_sheet_basic(self) -> None:
xlsx = _make_xlsx(
{
"Sheet1": [
["Name", "Age"],
["Alice", "30"],
["Bob", "25"],
]
}
)
result = xlsx_to_text(xlsx)
lines = [line for line in result.strip().split("\n") if line.strip()]
assert len(lines) == 3
assert "Name" in lines[0]
assert "Age" in lines[0]
assert "Alice" in lines[1]
assert "30" in lines[1]
assert "Bob" in lines[2]
def test_multiple_sheets_separated(self) -> None:
xlsx = _make_xlsx(
{
"Sheet1": [["a", "b"]],
"Sheet2": [["c", "d"]],
}
)
result = xlsx_to_text(xlsx)
# TEXT_SECTION_SEPARATOR is "\n\n"
assert "\n\n" in result
parts = result.split("\n\n")
assert any("a" in p for p in parts)
assert any("c" in p for p in parts)
def test_empty_cells(self) -> None:
xlsx = _make_xlsx(
{
"Sheet1": [
["a", "", "b"],
["", "c", ""],
]
}
)
result = xlsx_to_text(xlsx)
lines = [line for line in result.strip().split("\n") if line.strip()]
assert len(lines) == 2
def test_commas_in_cells_are_quoted(self) -> None:
"""Cells containing commas should be quoted in CSV output."""
xlsx = _make_xlsx(
{
"Sheet1": [
["hello, world", "normal"],
]
}
)
result = xlsx_to_text(xlsx)
assert '"hello, world"' in result
def test_empty_workbook(self) -> None:
xlsx = _make_xlsx({"Sheet1": []})
result = xlsx_to_text(xlsx)
assert result.strip() == ""
def test_long_empty_row_run_capped(self) -> None:
"""Runs of >2 empty rows should be capped to 2."""
xlsx = _make_xlsx(
{
"Sheet1": [
["header"],
[""],
[""],
[""],
[""],
["data"],
]
}
)
result = xlsx_to_text(xlsx)
lines = [line for line in result.strip().split("\n") if line.strip()]
# 4 empty rows capped to 2, so: header + 2 empty + data = 4 lines
assert len(lines) == 4
assert "header" in lines[0]
assert "data" in lines[-1]
def test_long_empty_col_run_capped(self) -> None:
"""Runs of >2 empty columns should be capped to 2."""
xlsx = _make_xlsx(
{
"Sheet1": [
["a", "", "", "", "b"],
["c", "", "", "", "d"],
]
}
)
result = xlsx_to_text(xlsx)
lines = [line for line in result.strip().split("\n") if line.strip()]
assert len(lines) == 2
# Each row should have 4 fields (a + 2 empty + b), not 5
# csv format: a,,,b (3 commas = 4 fields)
first_line = lines[0].strip()
# Count commas to verify column reduction
assert first_line.count(",") == 3
def test_short_empty_runs_kept(self) -> None:
"""Runs of <=2 empty rows/cols should be preserved."""
xlsx = _make_xlsx(
{
"Sheet1": [
["a", "b"],
["", ""],
["", ""],
["c", "d"],
]
}
)
result = xlsx_to_text(xlsx)
lines = [line for line in result.strip().split("\n") if line.strip()]
# All 4 rows preserved (2 empty rows <= threshold)
assert len(lines) == 4
def test_bad_zip_file_returns_empty(self) -> None:
bad_file = io.BytesIO(b"not a zip file")
result = xlsx_to_text(bad_file, file_name="test.xlsx")
assert result == ""
def test_bad_zip_tilde_file_returns_empty(self) -> None:
bad_file = io.BytesIO(b"not a zip file")
result = xlsx_to_text(bad_file, file_name="~$temp.xlsx")
assert result == ""
def test_large_sparse_sheet(self) -> None:
"""A sheet with data, a big empty gap, and more data — gap is capped to 2."""
rows: list[list[str]] = [["row1_data"]]
rows.extend([[""] for _ in range(10)])
rows.append(["row2_data"])
xlsx = _make_xlsx({"Sheet1": rows})
result = xlsx_to_text(xlsx)
lines = [line for line in result.strip().split("\n") if line.strip()]
# 10 empty rows capped to 2: row1_data + 2 empty + row2_data = 4
assert len(lines) == 4
assert "row1_data" in lines[0]
assert "row2_data" in lines[-1]
def test_quotes_in_cells(self) -> None:
"""Cells containing quotes should be properly escaped."""
xlsx = _make_xlsx(
{
"Sheet1": [
['say "hello"', "normal"],
]
}
)
result = xlsx_to_text(xlsx)
# csv.writer escapes quotes by doubling them
assert '""hello""' in result
def test_each_row_is_separate_line(self) -> None:
"""Each row should produce its own line (regression for writerow vs writerows)."""
xlsx = _make_xlsx(
{
"Sheet1": [
["r1c1", "r1c2"],
["r2c1", "r2c2"],
["r3c1", "r3c2"],
]
}
)
result = xlsx_to_text(xlsx)
lines = [line for line in result.strip().split("\n") if line.strip()]
assert len(lines) == 3
assert "r1c1" in lines[0] and "r1c2" in lines[0]
assert "r2c1" in lines[1] and "r2c2" in lines[1]
assert "r3c1" in lines[2] and "r3c2" in lines[2]

View File

@@ -11,30 +11,13 @@ from onyx.hooks.api_dependencies import require_hook_enabled
class TestRequireHookEnabled:
def test_raises_when_multi_tenant(self) -> None:
with (
patch("onyx.hooks.api_dependencies.MULTI_TENANT", True),
patch("onyx.hooks.api_dependencies.HOOK_ENABLED", True),
):
with patch("onyx.hooks.api_dependencies.MULTI_TENANT", True):
with pytest.raises(OnyxError) as exc_info:
require_hook_enabled()
assert exc_info.value.error_code is OnyxErrorCode.SINGLE_TENANT_ONLY
assert exc_info.value.status_code == 403
assert "multi-tenant" in exc_info.value.detail
def test_raises_when_flag_disabled(self) -> None:
with (
patch("onyx.hooks.api_dependencies.MULTI_TENANT", False),
patch("onyx.hooks.api_dependencies.HOOK_ENABLED", False),
):
with pytest.raises(OnyxError) as exc_info:
require_hook_enabled()
assert exc_info.value.error_code is OnyxErrorCode.ENV_VAR_GATED
assert exc_info.value.status_code == 403
assert "HOOK_ENABLED" in exc_info.value.detail
def test_passes_when_enabled_single_tenant(self) -> None:
with (
patch("onyx.hooks.api_dependencies.MULTI_TENANT", False),
patch("onyx.hooks.api_dependencies.HOOK_ENABLED", True),
):
def test_passes_when_single_tenant(self) -> None:
with patch("onyx.hooks.api_dependencies.MULTI_TENANT", False):
require_hook_enabled() # must not raise

View File

@@ -2,6 +2,7 @@ import threading
from typing import Any
from typing import cast
from typing import List
from unittest.mock import MagicMock
from unittest.mock import Mock
from unittest.mock import patch
@@ -12,8 +13,13 @@ from onyx.connectors.models import Document
from onyx.connectors.models import DocumentSource
from onyx.connectors.models import ImageSection
from onyx.connectors.models import TextSection
from onyx.hooks.executor import HookSkipped
from onyx.hooks.executor import HookSoftFailed
from onyx.hooks.points.document_ingestion import DocumentIngestionResponse
from onyx.hooks.points.document_ingestion import DocumentIngestionSection
from onyx.indexing.chunker import Chunker
from onyx.indexing.embedder import DefaultIndexingEmbedder
from onyx.indexing.indexing_pipeline import _apply_document_ingestion_hook
from onyx.indexing.indexing_pipeline import add_contextual_summaries
from onyx.indexing.indexing_pipeline import filter_documents
from onyx.indexing.indexing_pipeline import process_image_sections
@@ -223,3 +229,148 @@ def test_contextual_rag(
count += 1
assert chunk.doc_summary == doc_summary
assert chunk.chunk_context == chunk_context
# ---------------------------------------------------------------------------
# _apply_document_ingestion_hook
# ---------------------------------------------------------------------------
_PATCH_EXECUTE_HOOK = "onyx.indexing.indexing_pipeline.execute_hook"
def _make_doc(
doc_id: str = "doc1",
sections: list[TextSection | ImageSection] | None = None,
) -> Document:
if sections is None:
sections = [TextSection(text="Hello", link="http://example.com")]
return Document(
id=doc_id,
title="Test Doc",
semantic_identifier="test-doc",
sections=cast(list[TextSection | ImageSection], sections),
source=DocumentSource.FILE,
metadata={},
)
def test_document_ingestion_hook_skipped_passes_through() -> None:
doc = _make_doc()
with patch(_PATCH_EXECUTE_HOOK, return_value=HookSkipped()):
result = _apply_document_ingestion_hook([doc], MagicMock())
assert result == [doc]
def test_document_ingestion_hook_soft_failed_passes_through() -> None:
doc = _make_doc()
with patch(_PATCH_EXECUTE_HOOK, return_value=HookSoftFailed()):
result = _apply_document_ingestion_hook([doc], MagicMock())
assert result == [doc]
def test_document_ingestion_hook_none_sections_drops_document() -> None:
doc = _make_doc()
with patch(
_PATCH_EXECUTE_HOOK,
return_value=DocumentIngestionResponse(
sections=None, rejection_reason="PII detected"
),
):
result = _apply_document_ingestion_hook([doc], MagicMock())
assert result == []
def test_document_ingestion_hook_all_invalid_sections_drops_document() -> None:
"""A non-empty list where every section has neither text nor image_file_id drops the doc."""
doc = _make_doc()
with patch(
_PATCH_EXECUTE_HOOK,
return_value=DocumentIngestionResponse(sections=[DocumentIngestionSection()]),
):
result = _apply_document_ingestion_hook([doc], MagicMock())
assert result == []
def test_document_ingestion_hook_empty_sections_drops_document() -> None:
doc = _make_doc()
with patch(
_PATCH_EXECUTE_HOOK,
return_value=DocumentIngestionResponse(sections=[]),
):
result = _apply_document_ingestion_hook([doc], MagicMock())
assert result == []
def test_document_ingestion_hook_rewrites_text_sections() -> None:
doc = _make_doc(sections=[TextSection(text="original", link="http://a.com")])
with patch(
_PATCH_EXECUTE_HOOK,
return_value=DocumentIngestionResponse(
sections=[DocumentIngestionSection(text="rewritten", link="http://b.com")]
),
):
result = _apply_document_ingestion_hook([doc], MagicMock())
assert len(result) == 1
assert len(result[0].sections) == 1
section = result[0].sections[0]
assert isinstance(section, TextSection)
assert section.text == "rewritten"
assert section.link == "http://b.com"
def test_document_ingestion_hook_preserves_image_section_order() -> None:
"""Hook receives all sections including images and controls final ordering."""
image = ImageSection(image_file_id="img-1", link=None)
doc = _make_doc(
sections=cast(
list[TextSection | ImageSection],
[TextSection(text="original", link=None), image],
)
)
# Hook moves the image before the text section
with patch(
_PATCH_EXECUTE_HOOK,
return_value=DocumentIngestionResponse(
sections=[
DocumentIngestionSection(image_file_id="img-1", link=None),
DocumentIngestionSection(text="rewritten", link=None),
]
),
):
result = _apply_document_ingestion_hook([doc], MagicMock())
assert len(result) == 1
sections = result[0].sections
assert len(sections) == 2
assert (
isinstance(sections[0], ImageSection) and sections[0].image_file_id == "img-1"
)
assert isinstance(sections[1], TextSection) and sections[1].text == "rewritten"
def test_document_ingestion_hook_mixed_batch() -> None:
"""Drop one doc, rewrite another, pass through a third."""
doc_drop = _make_doc(doc_id="drop")
doc_rewrite = _make_doc(doc_id="rewrite")
doc_skip = _make_doc(doc_id="skip")
def _side_effect(**kwargs: Any) -> Any:
doc_id = kwargs["payload"]["document_id"]
if doc_id == "drop":
return DocumentIngestionResponse(sections=None)
if doc_id == "rewrite":
return DocumentIngestionResponse(
sections=[DocumentIngestionSection(text="new text", link=None)]
)
return HookSkipped()
with patch(_PATCH_EXECUTE_HOOK, side_effect=_side_effect):
result = _apply_document_ingestion_hook(
[doc_drop, doc_rewrite, doc_skip], MagicMock()
)
assert len(result) == 2
ids = {d.id for d in result}
assert ids == {"rewrite", "skip"}
rewritten = next(d for d in result if d.id == "rewrite")
assert isinstance(rewritten.sections[0], TextSection)
assert rewritten.sections[0].text == "new text"

View File

@@ -417,3 +417,57 @@ def test_categorize_text_under_token_limit_accepted(
assert len(result.acceptable) == 1
assert result.acceptable_file_to_token_count["ok.txt"] == 500
# --- skip-indexing vs rejection by file type ---
def test_csv_over_token_threshold_accepted_skip_indexing(
monkeypatch: pytest.MonkeyPatch,
) -> None:
"""CSV exceeding token threshold is uploaded but flagged to skip indexing."""
_patch_common_dependencies(monkeypatch, upload_size_mb=1000, token_threshold_k=1)
text = "x" * 2000 # 2000 tokens > 1000 threshold
monkeypatch.setattr(utils, "extract_file_text", lambda **_kwargs: text)
upload = _make_upload("large.csv", size=2000, content=text.encode())
result = utils.categorize_uploaded_files([upload], MagicMock())
assert len(result.acceptable) == 1
assert result.acceptable[0].filename == "large.csv"
assert "large.csv" in result.skip_indexing
assert len(result.rejected) == 0
def test_csv_under_token_threshold_accepted_and_indexed(
monkeypatch: pytest.MonkeyPatch,
) -> None:
"""CSV under token threshold is uploaded and indexed normally."""
_patch_common_dependencies(monkeypatch, upload_size_mb=1000, token_threshold_k=1)
text = "x" * 500 # 500 tokens < 1000 threshold
monkeypatch.setattr(utils, "extract_file_text", lambda **_kwargs: text)
upload = _make_upload("small.csv", size=500, content=text.encode())
result = utils.categorize_uploaded_files([upload], MagicMock())
assert len(result.acceptable) == 1
assert result.acceptable[0].filename == "small.csv"
assert "small.csv" not in result.skip_indexing
assert len(result.rejected) == 0
def test_pdf_over_token_threshold_rejected(
monkeypatch: pytest.MonkeyPatch,
) -> None:
"""PDF exceeding token threshold is rejected entirely (not uploaded)."""
_patch_common_dependencies(monkeypatch, upload_size_mb=1000, token_threshold_k=1)
text = "x" * 2000 # 2000 tokens > 1000 threshold
monkeypatch.setattr(utils, "extract_file_text", lambda **_kwargs: text)
upload = _make_upload("big.pdf", size=2000, content=text.encode())
result = utils.categorize_uploaded_files([upload], MagicMock())
assert len(result.rejected) == 1
assert result.rejected[0].filename == "big.pdf"
assert "1K token limit" in result.rejected[0].reason
assert len(result.acceptable) == 0

Some files were not shown because too many files have changed in this diff Show More