Compare commits

..

844 Commits

Author SHA1 Message Date
SubashMohan
c54752e3ba fix non accepted files popup 2025-10-23 18:19:45 +05:30
SubashMohan
bd32795804 fix file item issues 2025-10-23 18:19:45 +05:30
SubashMohan
7aa6b01ac0 Update Truncated component styling for better text handling 2025-10-23 18:19:45 +05:30
SubashMohan
93084a3a39 fix project bugs 2025-10-23 18:19:45 +05:30
Nikolas Garza
e46f632570 fix: allow user knowledge (file uploads) always (#5857)
Co-authored-by: Nikolas Garza <nikolas@Nikolass-MacBook-Pro.attlocal.net>
2025-10-22 23:37:25 +00:00
Justin Tahara
bbb4b9eda3 fix(docker): Clean up USE_IAM_AUTH log (#5870) 2025-10-22 15:59:12 -07:00
Richard Guan
12b7c7d4dd chore(ripsecrets): ripsecrets (#5868) 2025-10-22 22:02:57 +00:00
trial2onyx
464967340b chore(docker): prefer uv for installing python system packages (#5861)
Co-authored-by: Onyx Trialee 2 <onyxtrial2@Onyxs-MBP.attlocal.net>
2025-10-22 21:48:41 +00:00
trial2onyx
a2308c2f45 chore(gha): deduplicate prepare-build and migrate to uv (#5862)
Co-authored-by: Onyx Trialee 2 <onyxtrial2@Onyxs-MBP.attlocal.net>
2025-10-22 21:20:37 +00:00
trial2onyx
2ee9f79f71 chore(docker): remove empty echo ONYX_VERSION layers (#5848)
Co-authored-by: Onyx Trialee 2 <onyxtrial2@Onyxs-MBP.attlocal.net>
2025-10-22 20:36:36 +00:00
trial2onyx
c3904b7c96 fix(release): correctly set ONYX_VERSION in model-server image (#5847)
Co-authored-by: Onyx Trialee 2 <onyxtrial2@Onyxs-MBP.attlocal.net>
2025-10-22 19:56:55 +00:00
trial2onyx
5009dcf911 chore(docker): avoid duplicating cached models layer (#5845)
Co-authored-by: Onyx Trialee 2 <onyxtrial2@Onyxs-MBP.attlocal.net>
2025-10-22 19:56:42 +00:00
trial2onyx
c7b4a0fad9 chore(github): flag and enable docker build caching (#5839)
Co-authored-by: Onyx Trialee 2 <onyxtrial2@Onyxs-MBP.attlocal.net>
2025-10-22 19:56:23 +00:00
Raunak Bhagat
60a402fcab Render chat and project button popovers using the PopoverMenu component (#5858) 2025-10-21 20:37:22 -07:00
Raunak Bhagat
c9bb078a37 Edit height of mask again (#5856) 2025-10-21 20:22:51 -07:00
Raunak Bhagat
c36c2a6c8d fix: Edit height of mask (#5855) 2025-10-21 20:17:39 -07:00
Raunak Bhagat
f9e2f9cbb4 refactor: Remove hover state on chatbutton rename (#5850) 2025-10-21 19:44:57 -07:00
Raunak Bhagat
0b7c808480 refactor: "Unnest" admin panel button (#5852) 2025-10-21 19:30:00 -07:00
Justin Tahara
0a6ff30ee4 fix(ui): Update spacing for the API Key page (#5826) 2025-10-21 18:26:14 -07:00
Raunak Bhagat
dc036eb452 fix: Spacings update (#5846) 2025-10-21 18:11:13 -07:00
Justin Tahara
ee950b9cbd fix(ui): Document Processing revamp (#5825) 2025-10-21 17:56:06 -07:00
Justin Tahara
dd71765849 fix(internal search): Restore functionality (#5843) 2025-10-21 16:54:10 -07:00
Raunak Bhagat
dc6b97f1b1 refactor: Edit message generation ui (#5816) 2025-10-21 16:51:14 -07:00
Richard Guan
d960c23b6a chore(fix): input images in msg (#5798) 2025-10-21 20:33:00 +00:00
Richard Guan
d9c753ba92 chore(simple): agent small adjustments (#5729) 2025-10-21 20:32:57 +00:00
Chris Weaver
60234dd6da feat: Improve litellm model map logic (#5829) 2025-10-21 13:22:35 -07:00
Justin Tahara
f88ef2e9ff fix(ui): Align Default Assistant Page (#5828) 2025-10-21 19:12:30 +00:00
Chris Weaver
6b479a01ea feat: run tasks for gated tenants (#5827) 2025-10-21 11:47:39 -07:00
Wenxi
248fe416e1 chore: update template reference to sso ee (#5830) 2025-10-21 11:39:12 -07:00
trial2onyx
cbea4bb75c chore(docker): avoid chown-ing playwright cache (#5805)
Co-authored-by: Onyx Trialee 2 <onyxtrial2@Onyxs-MBP.attlocal.net>
2025-10-21 17:23:49 +00:00
Justin Tahara
4a147a48dc fix(ui): Update Upload Image and Generate Icon buttons (#5824) 2025-10-21 10:41:57 -07:00
Chris Weaver
a77025cd46 fix: adjust deletion threshold (#5818) 2025-10-21 10:37:10 -07:00
Jessica Singh
d10914ccc6 fix(teams connector): special char bug (#5767) 2025-10-21 10:27:37 -07:00
Nikolas Garza
7d44d48f87 fix: switch out OnyxSparkleIcon for OnyxIcon for default assistant (#5806)
Co-authored-by: Nikolas Garza <nikolas@Nikolass-MacBook-Pro.attlocal.net>
2025-10-20 17:48:30 -07:00
Chris Weaver
82fd0e0316 fix: extra sidebar spacing (#5811) 2025-10-20 17:48:02 -07:00
Wenxi
d7e4c47ef1 fix: custom llm setup fixes (#5804) 2025-10-20 17:47:38 -07:00
Wenxi
799b0df1cb fix: don't set new default provider when deleted provider was not default (#5812)
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
2025-10-20 17:42:47 -07:00
Justin Tahara
b31d36564a fix(ui): Make Document Sets editable (#5809) 2025-10-20 17:41:08 -07:00
Chris Weaver
84df0a1bf9 feat: more cleanup script improvements (#5803) 2025-10-20 17:29:38 -07:00
Justin Tahara
dbc53fe176 fix(ui): Set as Default for LLM (#5795) 2025-10-20 17:21:12 -07:00
Wenxi
1e4ba93daa feat: optimistically rename chat sidebar items (#5810)
Co-authored-by: cubic-dev-ai[bot] <191113872+cubic-dev-ai[bot]@users.noreply.github.com>
2025-10-20 17:01:44 -07:00
Wenxi
d872715620 feat: azure parse deployment name (#5807)
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Co-authored-by: Evan Lohn <evan@danswer.ai>
Co-authored-by: cubic-dev-ai[bot] <191113872+cubic-dev-ai[bot]@users.noreply.github.com>
2025-10-20 16:52:56 -07:00
Chris Weaver
46ad541ebc fix: whitelabling assistant logo (#5808) 2025-10-20 16:52:26 -07:00
Raunak Bhagat
613907a06f fix: Fix colouring for all error pages (#5802) 2025-10-20 15:42:41 -07:00
Wenxi
ff723992d1 feat: openrouter support (#5772) 2025-10-20 14:24:12 -07:00
Raunak Bhagat
bda3c6b189 fix: Edit background colour (#5800) 2025-10-20 14:09:53 -07:00
Evan Lohn
264d1de994 chore: disable contextual rag in the cloud (#5801) 2025-10-20 13:52:57 -07:00
Nikolas Garza
335571ce79 feat: Add React testing framework (#5778) 2025-10-20 13:49:44 -07:00
Chris Weaver
4d3fac2574 feat: enhance tenant cleanup (#5788) 2025-10-20 13:29:04 -07:00
Evan Lohn
7c229dd103 fix: reduce spam of org info toast (#5794) 2025-10-20 13:01:05 -07:00
Evan Lohn
b5df182a36 chore: hide search settings in the cloud (#5796) 2025-10-20 13:00:48 -07:00
Justin Tahara
7e7cfa4187 fix(ui): Initial Index Attempt Tooltip (#5789) 2025-10-20 12:57:34 -07:00
Justin Tahara
69d8430288 fix(ui): Create Button Type (#5797) 2025-10-20 12:48:39 -07:00
Raunak Bhagat
467d294b30 fix: Add white-labelling back (#5757) 2025-10-19 13:25:25 -07:00
Chris Weaver
ba2dd18233 feat: improve performance of deletion scripts (#5787) 2025-10-19 13:02:19 -07:00
Chris Weaver
891eeb0212 feat: add new fields to usage report (#5784) 2025-10-19 12:37:57 -07:00
Wenxi
9085731ff0 fix: add latest check to merge step (#5781) 2025-10-18 18:26:21 -07:00
Chris Weaver
f5d88c47f4 fix: docker-tag-latest.yml (#5780) 2025-10-18 09:06:14 -07:00
Raunak Bhagat
807e5c21b0 fix: Fix styling (#5776) 2025-10-17 18:49:33 -07:00
Raunak Bhagat
1bcd795011 fix: Font loading fix (#5773) 2025-10-17 18:39:30 -07:00
Raunak Bhagat
aae357df40 fix: Fix document sidebar positioning + update stylings (#5769)
Co-authored-by: cubic-dev-ai[bot] <191113872+cubic-dev-ai[bot]@users.noreply.github.com>
2025-10-17 18:26:53 -07:00
Justin Tahara
4f03e85c57 fix(llm): Cleaning up models (#5771) 2025-10-17 23:48:08 +00:00
Nikolas Garza
c3411fb28d feat: read latest permission sync from the frontend (#5687)
Co-authored-by: Nikolas Garza <nikolas@Nikolass-MacBook-Pro.attlocal.net>
Co-authored-by: Nikolas Garza <nikolas@Nikolass-MBP.attlocal.net>
2025-10-17 23:43:55 +00:00
Richard Guan
b3d1b1f4aa chore(langfuse): tracing (#5753) 2025-10-17 22:44:22 +00:00
Justin Tahara
cbb86c12aa fix(bedrock): Make Region Selectable (#5770) 2025-10-17 22:26:24 +00:00
Chris Weaver
8fd606b713 fix: documents in chat flow (#5762) 2025-10-17 13:56:38 -07:00
Nikolas Garza
d69170ee13 chore: cleanup some console.logs (#5766)
Co-authored-by: Nikolas Garza <nikolas@Nikolass-MacBook-Pro.attlocal.net>
2025-10-17 13:52:44 -07:00
Justin Tahara
e356c5308c fix(ui): Cleaning up the Edit Action page (#5765) 2025-10-17 13:52:36 -07:00
Wenxi
3026ac8912 feat: blob connector and test enhancements (#5746) 2025-10-17 13:52:03 -07:00
Justin Tahara
0cee7c849f feat(curators): Allow curators to customize Actions (#5752) 2025-10-17 19:07:58 +00:00
Yuhong Sun
14bfb7fd0c No Bash in Background Container (#5761) 2025-10-17 11:15:11 -07:00
Justin Tahara
804e48a3da fix(ui): Fix Available Methods Table (#5756) 2025-10-17 17:29:18 +00:00
SubashMohan
907271656e fix: Fix "Projects" new UI components (#5662) 2025-10-17 17:21:36 +00:00
Chris Weaver
1f11dd3e46 refactor: make OIDC / SAML MIT licensed (#5739) 2025-10-17 16:25:50 +00:00
Raunak Bhagat
048561ce0b fix: Fix colours for error page (#5758) 2025-10-17 09:35:34 -07:00
Raunak Bhagat
8718f10c38 fix: Fix all tooltips rendering raw text (#5755) 2025-10-17 09:34:46 -07:00
Evan Lohn
ab4d820089 feat: user info personalization (#5743) 2025-10-17 00:49:36 +00:00
Justin Tahara
77ae4f1a45 feat(users): Add User Counts (#5750) 2025-10-16 18:17:17 -07:00
Raunak Bhagat
8fd1f42a1c docs: Add a new standards file for the web directory (#5749) 2025-10-16 18:03:25 -07:00
Chris Weaver
b94c7e581b fix: quality checks (#5747)
Co-authored-by: Evan Lohn <evan@danswer.ai>
2025-10-16 16:58:24 -07:00
Wenxi
c90ff701dc chore: move gh non-secrets to vars (#5744) 2025-10-16 16:42:24 -07:00
Justin Tahara
b1ad58c5af fix(ui): Fix Invite Modal (#5748) 2025-10-16 16:30:48 -07:00
Eli Ben-Shoshan
345f9b3497 feat: added support to generate sha256 checksum before uploading file to object store (#5734)
Co-authored-by: Eli Ben-Shoshan <ebs@ufl.edu>
2025-10-16 14:36:12 -07:00
Justin Tahara
4671d18d4f fix(sso): Fix Logout UI (#5741) 2025-10-16 17:24:07 +00:00
Wenxi
f0598be875 fix: s3 connector citation bugs (#5740) 2025-10-16 10:08:13 -07:00
Justin Tahara
eb361c6434 feat(onboarding): Pin Featured Agents to New Users (#5736) 2025-10-15 16:11:56 -07:00
Nikolas Garza
e39b0a921c feat: plumb auto sync permission attempts to celery tasks (#5686)
Co-authored-by: Nikolas Garza <nikolas@Nikolass-MacBook-Pro.attlocal.net>
2025-10-15 21:15:35 +00:00
Jessica Singh
2dd8a8c788 fix(slack bot ui): update tokens dark mode (#5728)
Co-authored-by: Raunak Bhagat <r@rabh.io>
2025-10-15 21:14:41 +00:00
Raunak Bhagat
8b79e2e90b feat: Unpin agent (#5721)
Co-authored-by: cubic-dev-ai[bot] <191113872+cubic-dev-ai[bot]@users.noreply.github.com>
2025-10-15 11:20:49 -07:00
Nikolas Garza
d05941d1bd feat: basic db methods to create, update, delete permission sync attempts (#5682)
Co-authored-by: Nikolas Garza <nikolas@Nikolass-MacBook-Pro.attlocal.net>
2025-10-15 17:28:18 +00:00
Raunak Bhagat
50070fb264 refactor: Make colours in AppSidebar darker (#5725) 2025-10-15 10:15:36 -07:00
edwin-onyx
5792d8d5ed fix(infra): consolidate more celery workers into background worker for default lightweight mode (#5718) 2025-10-15 15:48:18 +00:00
Raunak Bhagat
e1c4b33cf7 fix: Edit AccessRestrictedPage component to render the proper colours (#5724) 2025-10-15 03:10:47 +00:00
Richard Guan
2c2f6e7c23 feat(framework): simple agent to feature flags (#5692)
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
2025-10-14 20:27:07 -07:00
Raunak Bhagat
3d30233d46 fix: Edit rendering issues with attached files (CSVs, images, text files, all other files) (#5708)
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
2025-10-14 18:26:35 -07:00
Justin Tahara
875f8cff5c feat(google drive): Add small log (#5723) 2025-10-14 17:21:22 -07:00
Raunak Bhagat
6e4686a09f fix: Update colour of checkbox (#5707) 2025-10-14 22:47:31 +00:00
Justin Tahara
237c18e15e fix(ui): Fix Assistant Image not showing up in Sidebar (#5722) 2025-10-14 21:47:42 +00:00
Justin Tahara
a71d80329d fix(admin): Properly show Unique User count (#5717) 2025-10-14 21:44:09 +00:00
Justin Tahara
91c392b4fc fix(logout): Fix logout again for new UI (#5719) 2025-10-14 21:28:49 +00:00
Justin Tahara
a25df4002d fix(document sets): Delete Federated Slack Document Sets (#5716) 2025-10-14 18:45:57 +00:00
Nikolas Garza
436a5add88 feat: tables/migration for permission syncing attempts (#5681)
Co-authored-by: Nikolas Garza <nikolas@Nikolass-MacBook-Pro.attlocal.net>
2025-10-14 18:43:02 +00:00
edwin-onyx
3a4bb239b1 fix(infra): consolidate heavy, monitoring, and user file worker into one (#5558)
Co-authored-by: Edwin Luo <edwin@parafin.com>
2025-10-14 01:19:47 +00:00
Raunak Bhagat
2acb4cfdb6 fix: Fix selector colour (#5705) 2025-10-14 00:41:27 +00:00
Justin Tahara
f1d626adb0 fix(ui): Updated Document Sets UI (#5706) 2025-10-14 00:27:21 +00:00
Justin Tahara
5ca604f186 fix(slack): Fix Fed Slack Gear Button Error (#5704) 2025-10-13 23:16:26 +00:00
Chris Weaver
c19c76c3ad feat: tenant cleanup (#5703) 2025-10-13 14:19:02 -07:00
Justin Tahara
4555f6badc fix(entra): JWT Passthrough for Entra (#5697) 2025-10-13 19:57:11 +00:00
Justin Tahara
71bd643537 fix(helm): File Processing fix for helm (#5696) 2025-10-12 15:14:31 -07:00
SubashMohan
23f70f0a96 fix(indexing page): Improve page loading time (#5695) 2025-10-11 10:16:28 -07:00
Evan Lohn
c97672559a feat: org info (#5694) 2025-10-11 04:10:52 +00:00
Evan Lohn
243f0bbdbd fix: assorted mcp improvements (#5684) 2025-10-11 03:51:00 +00:00
Chris Weaver
0a5ca7f1cf feat: gemini-embedding-001 + search settings fixes (#5691) 2025-10-10 18:04:17 -07:00
Justin Tahara
8d56d213ec fix(ui): Update UI change (#5688) 2025-10-10 23:24:20 +00:00
Richard Guan
cea2ea924b feat(Simple Agent): [part1 - backwards compatible changes] (#5569) 2025-10-10 22:06:54 +00:00
Richard Guan
569d205e31 feat(flags): posthog feature flags (#5690) 2025-10-10 21:54:07 +00:00
Chris Weaver
9feff5002f fix: chat tweaks (#5685) 2025-10-09 22:04:56 -07:00
Nikolas Garza
a1314e49a3 fix: use system node version for prettier pre-commit hook (#5679)
Co-authored-by: Nikolas Garza <nikolas@Nikolass-MacBook-Pro.attlocal.net>
2025-10-09 17:21:19 -07:00
Nikolas Garza
463f839154 fix: show canceled status when indexing is canceled (#5675)
Co-authored-by: Nikolas Garza <nikolas@Nikolass-MacBook-Pro.attlocal.net>
2025-10-09 17:02:53 -07:00
Nikolas Garza
5a0fe3c1d1 fix: surface user friendly model names for bedrock models (#5680)
Co-authored-by: Nikolas Garza <nikolas@Nikolass-MacBook-Pro.attlocal.net>
2025-10-09 17:00:11 -07:00
Raunak Bhagat
8ac5c86c1e fix: Fix failing tests (#5628) 2025-10-09 22:34:49 +00:00
Chris Weaver
d803b48edd feat: Increase nginx default timeout (#5677) 2025-10-09 22:13:18 +00:00
Wenxi
bc3adcdc89 fix: non-image gen models and add test (#5678) 2025-10-09 14:38:15 -07:00
Raunak Bhagat
95e27f1c30 refactor: Clean up some context files (#5672) 2025-10-09 19:27:29 +00:00
Justin Tahara
d0724312db fix(helm): Removing duplicate exec (#5676) 2025-10-09 12:16:14 -07:00
Justin Tahara
5b1021f20b fix(eml): Fixing EML to Text Function (#5674) 2025-10-09 10:59:26 -07:00
Chris Weaver
55cdbe396f fix: action toggle (#5670) 2025-10-08 19:03:06 -07:00
Raunak Bhagat
e8fe0fecd2 feat: Update formik colours (action + danger) new new colour palette (#5668) 2025-10-08 14:42:26 -07:00
SubashMohan
5b4fc91a3e Fix/doc id migration task (#5620) 2025-10-08 14:40:12 -07:00
Raunak Bhagat
afd2d8c362 fix: Comment out keystroke hijacking (#5659) 2025-10-08 11:59:27 -07:00
Evan Lohn
8a8cf13089 feat: attachments are separate docs (#5641) 2025-10-08 10:29:06 -07:00
Evan Lohn
c7e872d4e3 feat: selectively run sf and hubspot tests (#5657) 2025-10-08 10:25:00 -07:00
Wenxi
1dbe926518 chore: bump dependabot backlog (#5653) 2025-10-08 09:57:27 -07:00
Chris Weaver
d095bec6df fix: deep research hiding logic (#5660) 2025-10-08 09:05:35 -07:00
Raunak Bhagat
58e8d501a1 fix: Add settings sections back (#5661) 2025-10-08 02:54:07 +00:00
Chris Weaver
a39782468b refactor: improve modal behavior (#5649) 2025-10-07 19:10:17 -07:00
Justin Tahara
d747b48d22 fix(Blob Storage): Add Chunking + Size Limits (#5638) 2025-10-07 19:02:01 -07:00
Yuhong Sun
817de23854 Remove document seeding (#5656) 2025-10-07 18:41:39 -07:00
Richard Guan
6474d30ba0 fix(braintrust): dont decorate generate and clean up unused code (#5636) 2025-10-07 15:11:47 -07:00
Paulius Klyvis
6c9635373a fix: ensure now and dt are in utc in gtp search (#5605) 2025-10-07 14:50:55 -07:00
Wenxi
1a945b6f94 chore: update comm links (#5650) 2025-10-07 14:40:54 -07:00
Chris Weaver
526c76fa08 fix: assistant creation (#5648) 2025-10-07 14:20:44 -07:00
Justin Tahara
932e62531f fix(UI): Update User Settings Model Selection (#5630) 2025-10-07 14:11:24 -07:00
edwin-onyx
83768e2ff1 fix(infra): lazy load nltk and some more (#5634) 2025-10-07 13:53:50 -07:00
Chris Weaver
f23b6506f4 fix: sidebar state persistence (#5647) 2025-10-07 13:25:16 -07:00
Chris Weaver
5f09318302 fix: pin text + create click (#5646) 2025-10-07 12:52:51 -07:00
Justin Tahara
674e789036 fix(playwright): Update Email Password Form (#5644) 2025-10-07 12:18:08 -07:00
Chris Weaver
cb514e6e34 fix: align text with icon (#5645) 2025-10-07 12:08:04 -07:00
Justin Tahara
965dad785c fix(images): Update Image Gen workflow after Refactor (#5631) 2025-10-07 12:03:48 -07:00
Chris Weaver
c9558224d2 feat: improved markdown spacing (#5643) 2025-10-07 12:02:24 -07:00
Chris Weaver
c2dbd3fd1e fix: slack bot creation (#5637) 2025-10-07 11:40:25 -07:00
Wenxi
d27c2b1b4e chore: update contributing readmes (#5635) 2025-10-07 10:24:58 -07:00
edwin-onyx
8c52444bda fix(infra): lazy load and don't warm up model server models (#5527)
Co-authored-by: cubic-dev-ai[bot] <191113872+cubic-dev-ai[bot]@users.noreply.github.com>
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
2025-10-07 09:07:40 -07:00
Justin Tahara
b4caa85cd4 fix(infra): Nginx updates (#5627) 2025-10-06 13:37:23 -07:00
Wenxi
57163dd936 chore: pin prettier pre-commit version and run on web for comm prs (#5624) 2025-10-06 12:08:46 -07:00
Chris Weaver
15f2a0bf60 fix: regression (#5625) 2025-10-06 12:02:31 -07:00
Justin Tahara
aeae7ebdef fix(SAML): Add additional Email Fields (#5557) 2025-10-06 12:00:43 -07:00
Raunak Bhagat
eaa14a5ce0 feat: UI Refresh (#5529)
Co-authored-by: SubashMohan <subashmohan75@gmail.com>
2025-10-05 23:04:14 -07:00
Shahar Mazor
b07c834e83 Add RTL support (#5609) 2025-10-05 12:24:38 -07:00
Chris Weaver
97cd308ef7 fix: remove duplicate "file" option in Add Connector page (#5612) 2025-10-05 11:59:39 -07:00
Nils
28cdab7a70 feat: support SharePoint Teams URLs (#5498)
Co-authored-by: nsklei <nils.kleinrahm@pledoc.de>
2025-10-05 11:34:35 -07:00
Chris Weaver
ad9aa01819 ci: adjust latest/edge tags (#5610) 2025-10-04 14:30:40 -07:00
Justin Tahara
508a88c8d7 fix(helm): Migrate from Bitanmi NGINX (#5599) 2025-10-03 17:45:59 -07:00
Justin Tahara
b6f81fbb8e fix(SSO): Logout funtionality fixed (#5600) 2025-10-03 16:58:51 -07:00
Chris Weaver
b9b66396ec fix: try reduce playwright flake (#5598) 2025-10-03 13:10:32 -07:00
Justin Tahara
dd20b9ef4c fix(helm): MinIO Migration from Bitnami (#5597) 2025-10-03 12:32:27 -07:00
Evan Lohn
e1f7e8cacf feat: better interface for slim connectors (#5592) 2025-10-03 10:51:14 -07:00
Justin Tahara
fd567279fd fix(helm): Chart dependency for DB chart (#5596) 2025-10-03 10:43:27 -07:00
Justin Tahara
1427eb3cf0 fix(helm): Remove Bitnmai Dependency for DB Charts (#5593) 2025-10-03 10:38:13 -07:00
trial-danswer
e70be0f816 feat: add serper web search provider (#5545) 2025-10-03 10:35:33 -07:00
Justin Tahara
0014c7cff7 Revert "fix(github): Revert cache being turned off" (#5594) 2025-10-03 09:48:44 -07:00
Richard Guan
1c23dbeaee fix(mcp): asyncio simple sync run (#5591) 2025-10-03 01:29:57 +00:00
Chris Weaver
b2b122a24b fix: jira perm sync (#5585) 2025-10-02 16:32:22 -07:00
Wenxi
033ae74b0e fix: allow web connector to recurse www even if not specified (#5584) 2025-10-02 16:14:54 -07:00
Evan Lohn
c593fb4866 fix(github): Revert cache being turned off (#5589) 2025-10-02 15:36:01 -07:00
trial-danswer
b9580ef346 feat: Add download users (#5563) 2025-10-02 15:30:27 -07:00
Wenxi
4df3a9204f fix: reindex logic and allow seeded docs to refresh (#5578) 2025-10-02 15:22:24 -07:00
Wenxi
e0ad313a60 chore: bump playwright version (#5581) 2025-10-02 15:16:54 -07:00
Evan Lohn
a2bfb46edd fix: deny invalid space keys (#5570) 2025-10-02 15:01:17 -07:00
Evan Lohn
25e3371bee fix: minor mcp fixes + test (#5564) 2025-10-02 13:05:56 -07:00
Chris Weaver
4b9b306140 feat: enable DR by default (#5576) 2025-10-02 12:53:16 -07:00
Wenxi
ccf55136be feat: ollama official support (#5509) 2025-10-02 10:47:16 -07:00
Evan Lohn
a13db828f3 Revert "fix(github): Revert cache being turned off" (#5575) 2025-10-02 10:22:43 -07:00
SubashMohan
b7d56d0645 increase docid migration task priority (#5571) 2025-10-02 21:42:55 +05:30
SubashMohan
9ac70d35a8 Fix/indexattempt deletion failure (#5573) 2025-10-02 08:42:16 -07:00
Evan Lohn
7da792dd27 fix: user info 404s (#5567) 2025-10-01 23:20:31 +00:00
Chris Weaver
136c2f4082 Skip flakey test (#5566) 2025-10-01 15:14:18 -07:00
edwin-onyx
67bd14e801 fix(infra): lazy import litellm and some more pkgs and add layer to connector instantiation for lazy loading (#5488)
Co-authored-by: cubic-dev-ai[bot] <191113872+cubic-dev-ai[bot]@users.noreply.github.com>
2025-10-01 15:05:27 -07:00
Evan Lohn
8c9a20be7a fix: atlassian scoped tokens (#5483) 2025-10-01 14:53:28 -07:00
Wenxi
0427845502 fix: weird text wrapping (#5565) 2025-10-01 14:41:49 -07:00
Chris Weaver
a85a5a324e More prod fixes (#5556) 2025-09-30 14:43:21 -07:00
SubashMohan
78f1fb5bf4 fix(projects): Fix Migration (#5550) 2025-09-30 12:56:40 -07:00
Evan Lohn
6a8a214324 fix: avoid attempting to retrieve with non-org owners (#5555) 2025-09-30 12:55:03 -07:00
Justin Tahara
884266c009 fix(saml): Update the route to take GET's and transform to POST (#5554) 2025-09-30 11:28:07 -07:00
Chris Weaver
2c422215e6 Fix prod compose (#5553) 2025-09-30 10:41:38 -07:00
joachim-danswer
32fe185bb4 fix: set gpt-5 thinking setting (#5539) 2025-09-30 09:57:36 -07:00
Chris Weaver
c2758a28d5 fix: project migration tweak (#5544) 2025-09-29 19:57:09 -07:00
Justin Tahara
5cda2e0173 feat(LLM): Add Claude Sonnet 4.5 (#5543) 2025-09-29 17:58:18 -07:00
Evan Lohn
9e885a68b3 feat: mcp client v2 (#5481) 2025-09-29 17:01:32 -07:00
Justin Tahara
376fc86b0c fix(saml): GET Method for SAML Callback (#5538) 2025-09-29 15:08:44 -07:00
Chris Weaver
2eb1444d80 fix: more test hardening (#5537) 2025-09-29 13:54:56 -07:00
SubashMohan
bd6ebe4718 feat(chat): add popup handling for image file selection in ChatInputBar (#5536) 2025-09-29 11:02:18 -07:00
Chris Weaver
691d63bc0f fix: remove console.log (#5533) 2025-09-29 10:54:54 -07:00
Chris Weaver
dfd4d9abef fix: playwright tests (#5522) 2025-09-29 09:04:10 -07:00
SubashMohan
4cb39bc150 fix chat issue and change view icon (#5525) 2025-09-29 12:28:07 +05:30
Chris Weaver
4e357478e0 fix: package-lock.json (#5530) 2025-09-28 13:43:42 -07:00
Wenxi
b5b1b3287c fix: update package lock after projects merge (#5514) 2025-09-28 13:00:32 -07:00
Wenxi
2f58a972eb fix: launch template post projects merge (#5528) 2025-09-28 12:57:54 -07:00
Yuhong Sun
6b39d8eed9 Docker Version Check (#5523) 2025-09-27 19:03:43 -07:00
Chris Weaver
f81c34d040 fix: editing/regeneration (#5521) 2025-09-27 17:43:03 -07:00
Yuhong Sun
0771b1f476 SQL plaintext file (#5520) 2025-09-27 15:36:44 -07:00
Jessica Singh
eedd2ba3fe fix(source selection): enable all by default and persist choice (#5511) 2025-09-26 17:15:40 -07:00
Chris Weaver
98554e5025 feat: small projects UX tweaks (#5513) 2025-09-26 15:33:37 -07:00
Justin Tahara
dcd2cad6b4 fix(infra): Increment Helm Version for Projects (#5512) 2025-09-26 13:59:27 -07:00
Chris Weaver
189f4bb071 fix: add bitbucket env vars (#5510) 2025-09-26 12:38:59 -07:00
SubashMohan
7eeab8fb80 feat(projects): add project creation and management (#5248)
Co-authored-by: Weves <chrisweaver101@gmail.com>
2025-09-26 12:05:20 -07:00
Justin Tahara
60f83dd0db fix(gmail): Skip over emails that don't have gmail enabled (#5506) 2025-09-25 19:57:47 -07:00
Jessica Singh
2618602fd6 fix(source filter): dark mode support (#5505) 2025-09-25 18:10:48 -07:00
Chris Weaver
b80f96de85 fix: LlmPopover after filling in an initial model (#5504) 2025-09-25 17:09:22 -07:00
edwin-onyx
74a15b2c01 fix(infra): fix some dependency hells and add some lazy loading to reduce celery worker RAM usage (#5478)
Co-authored-by: cubic-dev-ai[bot] <191113872+cubic-dev-ai[bot]@users.noreply.github.com>
2025-09-25 16:12:26 -07:00
Jessica Singh
408b80ce51 feat(source selection): adding source selection for internal search in chat (#5455) 2025-09-25 16:12:02 -07:00
Wenxi
e82b68c1b0 fix: update seeded docs connector name (#5502) 2025-09-25 15:58:54 -07:00
Justin Tahara
af5eec648b fix(playwright): Add new fix for Playwright test (#5503) 2025-09-25 15:34:24 -07:00
Chris Weaver
d186c5e82e feat(docker): Add DEV_MODE flag for exposing service ports (#5499)
Co-authored-by: Claude <noreply@anthropic.com>
Co-authored-by: justin-tahara <justintahara@gmail.com>
2025-09-25 15:08:20 -07:00
Justin Tahara
4420a50aed fix(github): Revert cache being turned off (#5487) 2025-09-25 14:07:58 -07:00
Justin Tahara
9caa6ea7ff feat(infra): Default to HPA w/ KEDA option (#5480) 2025-09-25 11:58:19 -07:00
Yuhong Sun
8d7b217d33 Deployment README (#5496) 2025-09-25 11:34:30 -07:00
Yuhong Sun
57908769f1 Port 80 (#5495) 2025-09-25 11:10:41 -07:00
Yuhong Sun
600cec7c89 Robust Install (#5494) 2025-09-25 10:08:52 -07:00
Yuhong Sun
bb8ea536c4 Update README.md (#5492) 2025-09-25 09:05:50 -07:00
Yuhong Sun
f97869b91e README (#5486) 2025-09-24 20:33:36 -07:00
Justin Tahara
aa5be56884 fix(github): Remove the Backport workflow (#5484) 2025-09-24 19:33:18 -07:00
Justin Tahara
7580178c95 fix(github): Fix Integration Tests (#5485) 2025-09-24 19:30:07 -07:00
Yuhong Sun
2e0bc8caf0 feat: Easy Install (#5461) 2025-09-24 15:31:45 -07:00
Chris Weaver
f9bd03c7f0 refactor: change venv activation (#5463) 2025-09-23 16:07:46 -07:00
Jessica Singh
77466e1f2b feat(slack bot): add federated search (#5275)
Co-authored-by: Jessica Singh <jessicasingh@Mac.attlocal.net>
Co-authored-by: Jessica Singh <jessicasingh@mac.lan>
2025-09-22 19:19:44 -07:00
Justin Tahara
8dd79345ed fix(sharepoint): Add secondary filter for embedded images (#5473) 2025-09-22 18:48:47 -07:00
Justin Tahara
a049835c49 fix(processing): Mime types for Image Summarization (#5471) 2025-09-22 18:48:31 -07:00
Yuhong Sun
d186d8e8ed Remove incredibly strict password reqs (#5470) 2025-09-22 17:25:34 -07:00
Yuhong Sun
082897eb9b Fix Toggles (#5469) 2025-09-22 17:09:43 -07:00
Yuhong Sun
e38f79dec5 Remove confusing text (#5468) 2025-09-22 15:37:54 -07:00
SubashMohan
26e7bba25d Fix/connector page stack depth limit (#5417) 2025-09-22 19:23:53 +05:30
edwin-onyx
3cde4ef77f fix(infra): create pre commit script and port vertex as lazy import (#5453)
Co-authored-by: Claude <noreply@anthropic.com>
2025-09-21 20:43:28 -07:00
Evan Lohn
f4d135d710 fix: sharepoint memory via excel parsing (#5444) 2025-09-19 17:10:27 -07:00
Richard Guan
6094f70ac8 fix: braintrust masking was over truncating (#5458)
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
2025-09-19 14:10:29 -07:00
Richard Guan
a90e58b39b feat: braintrust tracing (#5450) 2025-09-18 18:17:38 -07:00
Evan Lohn
e82e3141ed feat: zendesk rate limiting (#5452) 2025-09-18 16:35:48 -07:00
edwin-onyx
f8e9060bab fix(infra): remove transformers dependency for api server (#5441)
Co-authored-by: Edwin Luo <edwinluo@3ef5a334-3d74-4dbf-b1c8-d57dc87d5638.attlocal.net>
Co-authored-by: Claude <noreply@anthropic.com>
2025-09-18 12:59:12 -07:00
Jessica Singh
24831fa1a1 fix(slack): swapped checkpoint index (#5427) 2025-09-18 11:09:31 -07:00
edwin-onyx
f6a0e69b2a fix(infra): remove setfit dependency from api server (#5449) 2025-09-17 23:48:47 -07:00
Richard Guan
0394eaea7f fix: copy over tests/__init__.py on docker build (#5443) 2025-09-17 17:12:03 -07:00
Wenxi
898b8c316e feat: docs link on connector creation (#5447) 2025-09-17 17:06:35 -07:00
Chris Weaver
4b0c6d1e54 fix: image gen tool causing error (#5445) 2025-09-17 16:39:54 -07:00
Justin Tahara
da7dc33afa fix(Federated Slack): Persist Document Set for Federated Connectors (#5442) 2025-09-17 13:52:11 -07:00
Richard Guan
c558732ddd feat: eval pipeline (#5369) 2025-09-17 12:17:14 -07:00
Chris Weaver
339ad9189b fix: slackbot error (#5430) 2025-09-16 23:25:34 -07:00
Richard Guan
32d5e408b8 fix: HF Cache Warmup Fix and Celery Pool Management (#5435) 2025-09-16 18:57:52 -07:00
Justin Tahara
14ead457d9 fix(infra): Update chart releaser (#5434) 2025-09-16 16:58:43 -07:00
Justin Tahara
458cd7e832 fix(infra): Add KEDA Dependency (#5433) 2025-09-16 16:52:24 -07:00
Justin Tahara
770a2692e9 Revert "fix(infra): Add KEDA Dependency" (#5432) 2025-09-16 16:48:18 -07:00
Justin Tahara
5dd99b6acf fix(infra): Add KEDA Dependency (#5431) 2025-09-16 16:45:41 -07:00
Chris Weaver
6c7eb89374 fix: remove credential file log (#5429) 2025-09-16 15:48:33 -07:00
eric-zadara
fd11c16c6d feat(infra): Decouple helm chart from bitnami (#5200)
Co-authored-by: eric-zadara <eric-zadara@users.noreply.github.com>
2025-09-16 14:56:04 -07:00
Chris Weaver
11ec603c37 fix: Improve datetime replacement (#5425)
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
2025-09-16 14:49:06 -07:00
Justin Tahara
495d4cac44 feat(infra): Migrate from HPA to KEDA for all Services (#5370) 2025-09-16 13:56:50 -07:00
Wenxi Onyx
fd2d74ae2e onyx mcp server 2025-09-16 11:06:00 -07:00
Evan Lohn
4c7a2e486b fix: skip huge files on sdk fallback (#5421) 2025-09-15 18:24:06 -07:00
Chris Weaver
01e0ba6270 fix: tool seeding migration (#5422) 2025-09-15 16:46:01 -07:00
Wenxi
227dfc4a05 fix: skip excluded img files in sharepoint (#5418) 2025-09-15 11:30:19 -07:00
Chris Weaver
c3702b76b6 docs: add agent files (#5412) 2025-09-14 20:07:18 -07:00
Chris Weaver
bb239d574c feat: single default assistant (#5351) 2025-09-14 20:05:33 -07:00
Chris Weaver
172e5f0e24 feat: Move reg IT to parallel + blacksmith and have MIT only run on merge q… (#5413) 2025-09-13 17:33:45 -07:00
Nils
26b026fb88 SharePoint Connector Fix - Nested Subfolder Indexing (#5404)
Co-authored-by: nsklei <nils.kleinrahm@pledoc.de>
2025-09-13 11:33:01 +00:00
joachim-danswer
870629e8a9 fix: Azure adjustment (#5410) 2025-09-13 00:03:37 +00:00
danielkravets
a547112321 feat: bitbucket connector (#5294) 2025-09-12 18:15:09 -07:00
joachim-danswer
da5a94815e fix: initial response quality, particularly for General assistant (#5399) 2025-09-12 00:14:49 -07:00
Jessica Singh
e024472b74 fix(federated-slack): pass in valid query (#5402) 2025-09-11 19:27:43 -07:00
Chris Weaver
e74855e633 feat: use private registry (#5401) 2025-09-11 18:20:56 -07:00
Justin Tahara
e4c26a933d fix(infra): Fix helm test timeout (#5386) 2025-09-11 18:19:07 -07:00
Chris Weaver
36c96f2d98 fix: playwright (#5396) 2025-09-11 14:06:03 -07:00
Justin Tahara
1ea94dcd8d fix(security): Remove Hard Fail from Trivy (#5394) 2025-09-11 10:35:26 -07:00
Wenxi
2b1c5a0755 fix: remove unneeded dependency from requirements (#5390) 2025-09-10 21:49:02 -07:00
Chris Weaver
82b5f806ab feat: Improve migration (#5391) 2025-09-10 19:29:11 -07:00
Chris Weaver
6340c517d1 fix: missing connectors section (#5387)
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
2025-09-10 19:28:56 -07:00
joachim-danswer
3baae2d4f0 fix: tf/dr flow improvements (#5380) 2025-09-10 16:39:19 -07:00
Chris Weaver
d7c223ddd4 feat: playwright test speed improvement (#5388) 2025-09-10 16:19:56 -07:00
Chris Weaver
df4917243b fix: parallelized IT (#5389) 2025-09-10 14:37:36 -07:00
Justin Tahara
a79ab713ce feat(infra): Adding rety to Trivy tests (#5383) 2025-09-10 14:13:58 -07:00
Chris Weaver
d1f7cee959 feat: parallelized integration tests (#5021)
Co-authored-by: Claude <noreply@anthropic.com>
2025-09-10 12:15:02 -07:00
Justin Tahara
a3f41e20da feat(infra): Add Node Selector option to all Templates (#5384) 2025-09-10 10:23:54 -07:00
Chris Weaver
458ed93da0 feat: remove prompt table (#5348) 2025-09-10 10:21:57 -07:00
Chris Weaver
273d073bd7 fix: non-image gen models (#5381) 2025-09-09 15:52:03 -07:00
Wenxi
9455c8e5ae fix: add back reverted changes to readme (#5377) 2025-09-09 10:23:33 -07:00
Justin Tahara
d45d4389a0 Revert "fix: update contribution guide" (#5376)
Co-authored-by: Wenxi <wenxi@onyx.app>
2025-09-09 09:37:16 -07:00
Chris Weaver
bd901c0da1 fix: playwright tests (#5372)
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
2025-09-09 00:29:52 -07:00
Wenxi
2192605c95 feat: Bedrock API Keys & filter available models (#5343)
Co-authored-by: cubic-dev-ai[bot] <191113872+cubic-dev-ai[bot]@users.noreply.github.com>
2025-09-08 18:50:04 -07:00
Wenxi
d248d2f4e9 refactor: update seeded docs (#5364)
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
2025-09-08 18:06:29 -07:00
Chris Weaver
331c53871a fix: image gen display (#5367) 2025-09-08 17:47:17 -07:00
SubashMohan
f62d0d9144 feat(admin/connectors): Disable Auto Sync for unsupported auth; add disabled dropdown + tooltip (#5358) 2025-09-08 21:39:47 +00:00
Chris Weaver
427945e757 fix: model server build (#5362) 2025-09-08 14:00:33 -07:00
Wenxi
e55cdc6250 fix: new docs links (#5363) 2025-09-08 13:49:19 -07:00
sktbcpraha
6a01db9ff2 fix: IMAP - mail processing fixes (#5360) 2025-09-08 12:11:09 -07:00
Richard Guan
82e9df5c22 fix: various bug bash improvements (#5330) 2025-09-07 23:17:01 -07:00
Chris Weaver
16c2ef2852 feat: Make usage report gen a background job (#5342) 2025-09-07 14:44:40 -07:00
Edwin Luo
224a70eea9 fix: update contribution guide (#5354) 2025-09-07 13:06:37 -07:00
Chris Weaver
c457982120 fix: connector tests (#5353) 2025-09-07 11:57:34 -07:00
Chris Weaver
0649748da2 fix: playwright tests (#5352) 2025-09-07 11:24:26 -07:00
Wenxi
ddceddaa28 chore: bump litellm to fix self-hosted inference (#5349) 2025-09-06 19:29:26 -07:00
Evan Lohn
c6733a5026 fix: handle new error type (#5345) 2025-09-06 18:26:54 -07:00
Wenxi
7db744a5de refactor: simplify sharepoint document extraction (#5341) 2025-09-06 20:17:33 +00:00
Chris Weaver
cd2a8b0def Fix mypy (#5347) 2025-09-05 23:28:35 -07:00
Richard Guan
f15bc26cd6 fix: deep research and thoughtful assistant message context and trace all llm calls in langsmith (#5344)
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
2025-09-05 23:13:45 -07:00
Chris Weaver
65f35f0293 fix: whitelabeling (#5346) 2025-09-05 21:06:44 -07:00
joachim-danswer
4e3e608249 fix: Tweaks to Deep Research and some KG adjustments (#5305) 2025-09-06 00:57:26 +00:00
Richard Guan
719a092a12 fix: web search bugs [DAN-2351] (#5281) 2025-09-05 19:50:59 +00:00
wichmann-git
6a8fde7eb1 fix(teams): sanitize None displayName to 'Unknown User' before parsing (#5322) 2025-09-05 10:21:26 -07:00
Justin Tahara
4fdd0812a0 fix(admin): Block access to Custom Analytics Page (#5319)
Co-authored-by: Wenxi <wenxi@onyx.app>
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Co-authored-by: cubic-dev-ai[bot] <191113872+cubic-dev-ai[bot]@users.noreply.github.com>
2025-09-05 10:02:31 -07:00
Wenxi
4913dc1e85 fix: skip large sharepoint files (#5338) 2025-09-05 06:47:30 +00:00
Chris Weaver
4a43a9642e fix: citatons endpoint (#5336) 2025-09-04 21:51:52 -07:00
Evan Lohn
cc48a0c38e fix: jira cloud api v3 (#5337) 2025-09-04 21:50:35 -07:00
Chris Weaver
01ccfd2df7 fix: Try to avoid timeouts on image gen (#5316) 2025-09-04 16:14:28 -07:00
Wenxi
36d75786ee fix: honor freshdesk 429 (#5334) 2025-09-04 12:06:19 -07:00
Chris Weaver
f9bc38ba65 fix: Add back MCP (#5333) 2025-09-04 11:13:38 -07:00
Chris Weaver
3da283221d feat: Re-enable sentry (#5329) 2025-09-03 19:07:37 -07:00
Wenxi
90568d3bbb refactor: remove option to exclude citations from assistants (#5320) 2025-09-03 17:32:18 -07:00
Wenxi
7955ca938c fix: freshdesk password and rate limits (#5325) 2025-09-03 17:32:00 -07:00
Chris Weaver
f5d357eb28 fix: old send-message (#5328) 2025-09-03 16:25:38 -07:00
Evan Lohn
d83f616214 fix: incorrect assumptions about fields (#5324) 2025-09-03 21:58:46 +00:00
Chris Weaver
275c1bec3d fix: adjust search tool display (#5317) 2025-09-02 16:44:23 -07:00
Wenxi
7d1ef912e8 fix: allow chats to be moved out of chat groups (#5315)
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Co-authored-by: cubic-dev-ai[bot] <191113872+cubic-dev-ai[bot]@users.noreply.github.com>
2025-09-02 21:35:55 +00:00
Wenxi
2fe1d4c373 fix: better tool tips (#5314) 2025-09-02 19:39:10 +00:00
SubashMohan
2396ad309e fix: enhance SharePoint connector error handling and content retrieval (#5302) 2025-09-02 08:57:30 -07:00
Wenxi
0b13ef963a fix: allow web and file to show in results (#5290)
* allow web and file to show in results

* don't lag on backspacing 2nd char
2025-09-01 21:53:09 -07:00
Justin Tahara
83073f3ded fix(infra): Add Playwright Directory (#5313) 2025-09-01 19:35:48 -07:00
Wenxi
439a27a775 scroll forms on invalid submit (#5310) 2025-09-01 15:33:52 -07:00
Justin Tahara
91773a4789 fix(jira): Upgrade the Jira Python Version (#5309) 2025-09-01 15:33:03 -07:00
Chris Weaver
185beca648 Small center bar improvements (#5306) 2025-09-01 13:32:56 -07:00
Justin Tahara
2dc564c8df feat(infra): Add IAM support for Redis (#5267)
* feat: JIRA support for custom JQL filter (#5164)

* jira jql support

* jira jql fixes

* Address comment

---------

Co-authored-by: sktbcpraha <131408565+sktbcpraha@users.noreply.github.com>
2025-09-01 10:52:28 -07:00
Chris Weaver
b259f53972 Remove console-log (#5304) 2025-09-01 10:18:39 -07:00
Chris Weaver
f8beb08e2f Fix web build (#5303) 2025-09-01 10:18:06 -07:00
Evan Lohn
83c88c7cf6 feat: mcp client1 (#5271)
* working mcp implementation v1

* attempt openapi fix

* fastmcp
2025-09-01 09:52:35 -07:00
Chris Weaver
2372dd40e0 fix: small formatting fixes (#5300)
* SMall formatting fixes

* Fix mypy

* reorder imports
2025-08-31 23:19:22 -07:00
Chris Weaver
5cb6bafe81 Cleanup on ChatPage/MessagesDisply (#5299) 2025-08-31 21:29:17 -07:00
Mohamed Mathari
a0309b31c7 feat: Add Outline Connector (#5284)
* Outline

* fixConnector

* fixTest

* The date filtering is implemented correctly as client-side filtering, which is the only way to achieve it with the Outline API since it doesn't support date parameters natively.

* Update web/src/lib/connectors/connectors.tsx

Co-authored-by: cubic-dev-ai[bot] <191113872+cubic-dev-ai[bot]@users.noreply.github.com>

* no connector config for outline

* Update backend/onyx/connectors/outline/client.py

Co-authored-by: cubic-dev-ai[bot] <191113872+cubic-dev-ai[bot]@users.noreply.github.com>

* Fix all PR review issues: document ID prefixes, error handling, test assertions, and null guards

* Update backend/onyx/connectors/outline/client.py

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>

* The test no longer depends on external network connectivity to httpbin.org

* I've enhanced the OutlineApiClient.post() method in backend/onyx/connectors/outline/client.py to properly handle network-level exceptions that could crash the connector during synchronization:

* Polling mechanism

* Removed flag-based approach

* commentOnClasses

* commentOnClasses

* commentOnClasses

* responseStatus

* startBound

* Changed the method signature to match the interface

* ConnectorMissingCredentials

* Time Out shared config

* Missing Credential message

---------

Co-authored-by: cubic-dev-ai[bot] <191113872+cubic-dev-ai[bot]@users.noreply.github.com>
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
2025-08-31 20:56:10 -07:00
Chris Weaver
0fd268dba7 fix: message render performance (#5297)
* Separate out message display into it's own component

* Memoize AIMessage

* Cleanup

* Remove log

* Address greptile/cubic comments
2025-08-31 19:49:53 -07:00
Wenxi
f345da7487 fix: make radios and checkboxes actually clickable (#5298)
* dont nest labels, use htmlfor, fix slackbot form bug

* fix playwright tests for improved labels
2025-08-31 19:16:25 -07:00
Chris Weaver
f2dacf03f1 fix: Chat Page performance improvements (#5295)
* CC performance improvements r2

* More misc chat performance improvements

* Remove unused import

* Remove silly useMemo

* Fix small shift

* Address greptile + cubic + subash comments

* Fix build

* Improve document sidebar

* Remove console.log

* Remove more logs

* Fix build
2025-08-31 14:29:03 -07:00
Wenxi
e0fef50cf0 fix: don't skip ccpairs if embedding swap in progress (#5189)
* don't skip ccpairs if embedding swap in progress

* refactor check_for_indexing to properly handle search setting swaps

* mypy

* mypy

* comment debugging log

* nits and more efficient active index attempt check
2025-08-29 17:17:36 -07:00
Chris Weaver
6ba3eeefa5 feat: center bar + tool force + tool disable (#5272)
* Exploration

* Adding user-specific assistant preferences

* Small fixes

* Improvements

* Reset forced tools upon switch

* Add starter messages

* Improve starter messages

* Add misisng file

* cleanup renaming imports

* Address greptile/cubic comments

* Fix build

* Add back assistant info

* Fix image icon

* rebase fix

* Color corrections

* Small tweak

* More color correction

* Remove animation for now

* fix test

* Fix coloring + allow only one forced tool
2025-08-29 17:17:09 -07:00
Richard Guan
aa158abaa9 . (#5286) 2025-08-29 17:07:50 -07:00
Wenxi
255c2af1d6 feat: reorganize connectors pages (#5186)
* Add popular connectors sections and cleanup connectors page

* Add other connectors env var

* other connectors env var to vscode env template

* update playwright tests

* sort by popuarlity

* recategorize and sort by popularity
2025-08-29 16:59:00 -07:00
Chris Weaver
9ece3b0310 fix: improve index attempts API (#5287)
* Improve index attempts API

* Fix import
2025-08-29 16:15:58 -07:00
joachim-danswer
9e3aca03a7 fix: various dr issues and comments (#5280)
* replacement of "message_delta" etc as Enums + removal

* prompt changes

* cubic fixes where appropriate

* schema fixes + citation symbols

* various fixes

* fix for kg context in new search

* cw comments

* updates
2025-08-29 15:08:23 -07:00
Wenxi
dbd5d4d8f1 fix: allow jira api v3 (#5285)
* allow jira api v3

* don't rely on api version for parsing issues and separate cloud and dc versions
2025-08-29 14:02:01 -07:00
Chris Weaver
cdb97c3ce4 fix: test_soft_delete_chat_session (#5283)
* Fix test_soft_delete_chat_session

* Fix flakiness
2025-08-29 09:01:55 -07:00
Chris Weaver
f30ced31a9 fix: IT (#5276)
* Fix IT

* test

* Fix test

* test

* fix

* Fix test
2025-08-28 20:42:14 -07:00
Wenxi
6cc6c43234 fix: explain why limit=None is appropriate for discord (#5278)
* explain why limit=None is appropriate for discord

* linting
2025-08-28 14:17:46 -07:00
Wenxi
224d934cf4 fix: ruff complaint about type comparison (#5279)
* ruff complaint about type comparison

* ruff complaint type comparison
2025-08-28 14:17:30 -07:00
Nigel Brown
8ecdc61ad3 fix: Explicitly add limit to the function calls (#5273)
* Explicitly add limit to the function calls
This means we miss fewer messages. The default limit is 100.

Signed-off-by: nigel brown <nigel@stacklok.com>

* Update backend/onyx/connectors/discord/connector.py

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>

---------

Signed-off-by: nigel brown <nigel@stacklok.com>
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
2025-08-28 13:35:02 -07:00
Chris Weaver
08161db7ea fix: playwright tests (#5259)
* Fix playwright tests

* Address comment

* Fix
2025-08-27 23:23:55 -07:00
Richard Guan
b139764631 feat: Fast internet search (#5238)
* squash: combine all DR commits into one

Co-authored-by: Joachim Rahmfeld <joachim@onyx.app>
Co-authored-by: Rei Meguro <rmeguro@umich.edu>

* Fixes

* show KG in Assistant only if available

* KG only usable for KG Beta (for now)

* base file upload

* improvements

* raise error if uploaded context is too long

* More improvements

* Fix citations

* jank implementation of internet search with deep research that can kind of work

* early implementation for google api support

* .

* .

* .

* .

* .

* .

* .

* .

* .

* .

* .

* .

* .

* .

* .

* .

* .

* .

* .

* .

* .

* .

* .

* .

* .

* .

* .

* .

* .

* .

---------

Co-authored-by: Weves <chrisweaver101@gmail.com>
Co-authored-by: Joachim Rahmfeld <joachim@onyx.app>
Co-authored-by: Rei Meguro <rmeguro@umich.edu>
Co-authored-by: joachim-danswer <joachim@danswer.ai>
2025-08-27 20:03:02 -07:00
joachim-danswer
2b23dbde8d fix: small DR/Thoughtful mode fixes (#5269)
* fix budget calculation

* Internal custom tool fix + Okta special casing

* nits

* CW comments
2025-08-26 22:33:54 -07:00
Wenxi
2dec009d63 feat: add api/versions to onyx (#5268)
* add api/versions to onyx

* add test and rename onyx

* cubic nit

Co-authored-by: cubic-dev-ai[bot] <191113872+cubic-dev-ai[bot]@users.noreply.github.com>

* move api version constants and add explanatory comment

---------

Co-authored-by: cubic-dev-ai[bot] <191113872+cubic-dev-ai[bot]@users.noreply.github.com>
2025-08-26 18:14:54 -07:00
Chris Weaver
91eadae353 Fix logger startup (#5263) 2025-08-26 17:33:25 -07:00
Wenxi
8bff616e27 fix: clarify jql instructions and embed links (#5264)
* clarify jql instructions and embed links

* typo

* lint

* fix unit test
2025-08-26 17:27:07 -07:00
sktbcpraha
2c049e170f feat: JIRA support for custom JQL filter (#5164)
* jira jql support

* jira jql fixes
2025-08-26 12:44:39 -07:00
Oht8wooWi8yait9n
23e6d7ef3c Update gemini model names. (#5262)
Co-authored-by: Aaron Sells <aaron.b.sells@nasa.gov>
2025-08-26 12:33:02 -07:00
Chris Weaver
ed81e75edd fix: add jira auto-sync option in UI (#5260)
* Add jira auto-sync option in UI

* Fix build
2025-08-26 11:21:04 -07:00
Wenxi
de22fc3a58 remove dead code (#5261) 2025-08-26 11:14:12 -07:00
Cameron
009b7f60f1 Update date format used for fetching from Bookstack (#5221) 2025-08-26 09:49:38 -07:00
Chris Weaver
9d997e20df feat: frontend refactor + DR (#5225)
* squash: combine all DR commits into one

Co-authored-by: Joachim Rahmfeld <joachim@onyx.app>
Co-authored-by: Rei Meguro <rmeguro@umich.edu>

* Fixes

* show KG in Assistant only if available

* KG only usable for KG Beta (for now)

* base file upload

* raise error if uploaded context is too long

* improvements

* More improvements

* Fix citations

* better decision making

* improved decision-making in Orchestrator

* generic_internal tools

* Small tweak

* tool use improvements

* add on

* More image gen stuff

* fixes

* Small color improvements

* Markdown utils

* fixed end conditions (incl early exit for image generation)

* remove agent search + image fixes

* Okta tool support for reload

* Some cleanup

* Stream back search tool results as they come

* tool forcing

* fixed no-Tool-Assistant

* Support anthropic tool calling

* Support anthropic models better

* More stuff

* prompt fixes and search step numbers

* Fix hook ordering issue

* internal search fix

* Improve citation look

* Small UI improvements

* Improvements

* Improve dot

* Small chat fixes

* Small UI tweaks

* Small improvements

* Remove un-used code

* Fix

* Remove test_answer.py for now

* Fix

* improvements

* Add foreign keys

* early forcing

* Fix tests

* Fix tests

---------

Co-authored-by: Joachim Rahmfeld <joachim@onyx.app>
Co-authored-by: Rei Meguro <rmeguro@umich.edu>
Co-authored-by: joachim-danswer <joachim@danswer.ai>
2025-08-26 00:26:14 -07:00
Denizhan Dakılır
e6423c4541 Handle disabled auth in connector indexing status endpoint (#5256) 2025-08-25 16:42:46 -07:00
Wenxi
cb969ad06a add require_email_verification to values.yaml (#5249) 2025-08-25 22:02:49 +00:00
Sam Waddell
c4076d16b6 fix: update all log paths to reflect change related to non-root user (#5244) 2025-08-25 14:11:18 -07:00
Evan Lohn
04a607a718 ensure multi-tenant contextvar is passed (#5240) 2025-08-25 13:35:50 -07:00
Evan Lohn
c1e1aa9dfd fix: downloads are never larger than 20mb (#5247)
* fix: downloads are never larger than 20mb

* JT comments

* import to fix integration tests
2025-08-25 18:10:14 +00:00
Chris Weaver
1ed7abae6e Small improvement (#5250) 2025-08-25 08:07:36 +05:30
SubashMohan
cf4855822b Perf/indexing status page (#5142)
* indexing status optimization first draft

* refactor: update pagination logic and enhance UI for indexing status table

* add index attempt pruning job and display federated connectors in index status page

* update celery worker command to include index_attempt_cleanup queue

* refactor: enhance indexing status table and remove deprecated components

* mypy fix

* address review comments

* fix pagination reset issue

* add TODO for optimizing connector materialization and performance in future deployments

* enhance connector indexing status retrieval by adding 'get_all_connectors' option and updating pagination logic

* refactor: transition to paginated connector indexing status retrieval and update related components

* fix: initialize latest_index_attempt_docs_indexed to 0 in CCPairIndexingStatusTable component

* feat: add mock connector file support for indexing status retrieval and update indexing_statuses type to Sequence

* mypy fix

* refactor: rename indexing status endpoint to simplify API and update related components
2025-08-24 17:43:47 -07:00
Justin Tahara
e242b1319c fix(infra): Fixed RDS IAM Issue (#5245) 2025-08-22 18:13:12 -07:00
Justin Tahara
eba4b6620e feat(infra): AWS IAM Terraform (#5228)
* feat(infra): AWS IAM Terraform

* Fixing dependency issue

* Fixing more weird logic

* Final cleanup

* one change

* oops
2025-08-22 16:39:16 -07:00
Justin Tahara
3534515e11 feat(infra): Utilize AWS RDS IAM Auth (#5226)
* feat(infra): Utilize AWS RDS IAM Auth

* Update spacing

* Bump helm version
2025-08-21 17:35:53 -07:00
Justin Tahara
5602ff8666 fix: use only celery-shared for security context (#5236) (#5239)
* fix: use only celery-shared for security context

* fix: bump helm chart version 0.2.8

Co-authored-by: Sam Waddell <shwaddell28@gmail.com>
2025-08-21 17:25:06 -07:00
Sam Waddell
2fc70781b4 fix: use only celery-shared for security context (#5236)
* fix: use only celery-shared for security context

* fix: bump helm chart version 0.2.8
2025-08-21 14:15:07 -07:00
Justin Tahara
f76b4dec4c feat(infra): Ignoring local Terraform files (#5227)
* feat(infra): Ignoring local Terraform files

* Addressing some comments
2025-08-21 09:43:18 -07:00
Jessica Singh
a5a516fa8a refactor(model): move api-based embeddings/reranking calls out of model server (#5216)
* move api-based embeddings/reranking calls to api server out of model server, added/modified unit tests

* ran pre-commit

* fix mypy errors

* mypy and precommit

* move utils to right place and add requirements

* precommit check

* removed extra constants, changed error msg

* Update backend/onyx/utils/search_nlp_models_utils.py

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>

* greptile

* addressed comments

* added code enforcement to throw error

---------

Co-authored-by: Jessica Singh <jessicasingh@Mac.attlocal.net>
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
2025-08-20 21:50:21 +00:00
Sam Waddell
811a198134 docs: add non-root user info (#5224) 2025-08-20 13:50:10 -07:00
Sam Waddell
5867ab1d7d feat: add non-root user to backend and model-server images (#5134)
* feat: add non-root user to backend and model-server image

* feat: update values to support security context for index, inference, and celery_shared

* feat: add security context support for index and inference

* feat: add celery_shared security context support to celery worker templates

* fix: cache management strategy

* fix: update deployment files for volume mount

* fix: address comments

* fix: bump helm chart version for new security context template changes

* fix: bump helm chart version for new security context template changes

* feat: move useradd earlier in build for reduced image size

---------

Co-authored-by: Phil Critchfield <phil.critchfield@liatrio.com>
2025-08-20 13:49:50 -07:00
Jose Bañez
dd6653eb1f fix(connector): #5178 Add error handling and logging for empty answer text in Loopio Connector (#5179)
* fix(connector): #5178 Add error handling and logging for empty answer text in LoopioConnector

* fix(connector): onyx-dot-app#5178:  Improve handling of empty answer text in LoopioConnector

---------

Co-authored-by: Jose Bañez <jose@4gclinical.com>
2025-08-20 09:14:08 -07:00
Richard Guan
db457ef432 fix(admin): [DAN-2202] Remove users from invited users after accept (#5214)
* .

* .

* .

* .

* .

* .

* .

---------

Co-authored-by: Richard Guan <richardguan@Richards-MacBook-Pro.local>
Co-authored-by: Richard Guan <richardguan@Mac.attlocal.net>
2025-08-20 03:55:02 +00:00
Richard Guan
de7fe939b2 . (#5212)
Co-authored-by: Richard Guan <richardguan@Richards-MBP.lan>
2025-08-20 02:36:44 +00:00
Chris Weaver
38114d9542 fix: PDF file upload (#5218)
* Fix / improve file upload

* Address cubic comment
2025-08-19 15:16:08 -07:00
Justin Tahara
32f20f2e2e feat(infra): Add WAF implementation (#5213) (#5217)
* feat(infra): Add WAF implementation

* Addressing greptile comments

* Additional removal of unnecessary code
2025-08-19 13:01:40 -07:00
Justin Tahara
3dd27099f7 feat(infra): Add WAF implementation (#5213)
* feat(infra): Add WAF implementation

* Addressing greptile comments

* Additional removal of unnecessary code
2025-08-18 17:45:50 -07:00
Cameron
91c4d43a80 Move @types packages to devDependencies (#5210) 2025-08-18 14:34:09 -07:00
SubashMohan
a63ba1bb03 fix: sharepoint group not found error and url with apostrophe (#5208)
* fix: handle ClientRequestException in SharePoint permission utils and connector

* feat: enhance SharePoint permission utilities with logging and URL handling

* greptile typo fix

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>

* enhance group sync handling for public groups

---------

Co-authored-by: Wenxi <wenxi@onyx.app>
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
2025-08-18 17:12:59 +00:00
Evan Lohn
7b6189e74c corrected routing (#5202) 2025-08-18 16:07:28 +00:00
Evan Lohn
ba423e5773 fix: model server concurrency (#5206)
* fix: model server race cond

* fix async

* different approach
2025-08-18 16:07:16 +00:00
SubashMohan
fe029eccae chore: add SharePoint sync environment variables to integration test (#5197)
* chore: add SharePoint sync environment variables to integration test workflows

* fix cubic comments

* test: skip SharePoint permission tests for non-enterprise

* test: update SharePoint permission tests to skip for non-enterprise environments
2025-08-18 03:21:04 +00:00
Wenxi
ea72af7698 fix sharepoint tests (#5209) 2025-08-17 22:25:47 +00:00
Wenxi
17abf85533 fix unpaused user files (#5205) 2025-08-16 01:39:16 +00:00
Wenxi
3bd162acb9 fix: sharepoint tests and indexing logic (#5204)
* don't index onedrive personal sites in sharepoint

* fix sharepoint tests and indexing behavior

* remove print
2025-08-15 18:19:42 -07:00
Evan Lohn
664ce441eb generous timeout between docfetching finishing and docprocessing starting (#5201) 2025-08-15 15:43:01 -07:00
Wenxi
6863fbee54 fix: validate sharepoint connector with validate_connector_settings (#5199)
* validate sharepoint connector with validate_connector_settings

* fix test

* fix tests
2025-08-15 00:38:31 +00:00
Justin Tahara
bb98088b80 fix(infra): Fix Helm Chart Test (#5198) 2025-08-14 23:28:17 +00:00
Justin Tahara
ce8cb1112a feat(infra): Adding new AWS Terraform Template Code (#5194)
* feat(infra): Adding new AWS Terraform Template Code

* Addressing greptile comments

* Applying some updates after the cubic reviews as well

* Adding one detail

* Removing unused var

* Addressing more cubic comments
2025-08-14 16:47:15 -07:00
Nils
a605bd4ca4 feat: make sharepoint documents and sharepoint pages optional (#5183)
* feat: make sharepoint documents and sharepoint pages optional

* fix: address review feedback for PR #5183

* fix: exclude personal sites from sharepoint connector

---------

Co-authored-by: Nils Kleinrahm <nils.kleinrahm@pledoc.de>
2025-08-14 15:17:23 -07:00
Dominic Feliton
0e8b5af619 fix(connector): user file helm start cmd + legacy file connector incompatibility (#5195)
* Fix user file helm start cmd + legacy file connector incompatibility

* typo

* remove unnecessary logic

* undo

* make recommended changes

* keep comment

* cleanup

* format

---------

Co-authored-by: Dominic Feliton <37809476+dominicfeliton@users.noreply.github.com>
2025-08-14 13:20:19 -07:00
SubashMohan
46f3af4f68 enhance file processing with content type handling (#5196) 2025-08-14 08:59:53 +00:00
Evan Lohn
2af64ebf4c fix: ensure exception strings don't get swallowed (#5192)
* ensure exception strings don't get swallowed

* just send exception code
2025-08-13 20:05:16 +00:00
Evan Lohn
0eb1824158 fix: sf connector docs (#5171)
* fix: sf connector docs

* more sf logs

* better logs and new attempt

* add fields to error temporarily

* fix sf

---------

Co-authored-by: Wenxi <wenxi@onyx.app>
2025-08-13 17:52:32 +00:00
Chris Weaver
e0a9a6fb66 feat: okta profile tool (#5184)
* Initial Okta profile tool

* Improve

* Fix

* Improve

* Improve

* Address EL comments
2025-08-13 09:57:31 -07:00
Wenxi
fe194076c2 make default personas hideable (#5190) 2025-08-13 01:12:51 +00:00
Wenxi
55dc24fd27 fix: seeded total doc count (#5188)
* fix seeded total doc count

* fix seeded total doc count
2025-08-13 00:19:06 +00:00
Evan Lohn
da02962a67 fix: thread safe approach to docprocessing logging (#5185)
* thread safe approach to docprocessing logging

* unify approaches

* reset
2025-08-12 02:25:47 +00:00
SubashMohan
9bc62cc803 feat: sharepoint perm sync (#5033)
* sharepoint perm sync first draft

* feat: Implement SharePoint permission synchronization

* mypy fix

* remove commented code

* bot comments fixes and job failure fixes

* introduce generic way to upload certificates in credentials

* mypy fix

* add checkpoiting to sharepoint connector

* add sharepoint integration tests

* Refactor SharePoint connector to derive tenant domain from verified domains and remove direct tenant domain input from credentials

* address review comments

* add permission sync to site pages

* mypy fix

* fix tests error

* fix tests and address comments

* Update file extraction behavior in SharePoint connector to continue processing on unprocessable files
2025-08-11 16:59:16 +00:00
Evan Lohn
bf6705a9a5 fix: max tokens param (#5174)
* max tokens param

* fix unit test

* fix unit test
2025-08-11 09:57:44 -07:00
Rei Meguro
df2fef3383 fix: removal of old tags + is_list differentiation (#5147)
* initial migration

* getting metadata from tags

* complete migration

* migration override for cloud

* fix: more robust structured tag gen

* tag and indexing update

* fix: move is_list to tags

* migration rebase

* test cases + bugfix on unique constraint

* fix logging
2025-08-10 22:39:33 +00:00
SubashMohan
8cec3448d7 fix: restrict user file access to current user only (#5177)
* fix: restrict user file access to current user only

* fix: enhance user file access control for recent folder
2025-08-10 19:00:18 +00:00
Justin Tahara
b81687995e fix(infra): Removing invalid helm version (#5176) 2025-08-08 18:40:55 -07:00
Justin Tahara
87c2253451 fix(infra): Update github workflow to not tag latest (#5172)
* fix(infra): Update github workflow to not tag latest

* Cleaned up the code a bit
2025-08-08 23:23:55 +00:00
Wenxi
297c2957b4 add gpt 5 display names (#5175) 2025-08-08 16:58:47 -07:00
Wenxi
bacee0d09d fix: sanitize slack payload before logging (#5167)
* sanitize slack payload before logging

* nit
2025-08-08 02:10:00 +00:00
Evan Lohn
297720c132 refactor: file processing (#5136)
* file processing refactor

* mypy

* CW comments

* address CW
2025-08-08 00:34:35 +00:00
Evan Lohn
bd4bd00cef feat: office parsing markitdown (#5115)
* switch to markitdown untested

* passing tests

* reset file

* dotenv version

* docs

* add test file

* add doc

* fix integration test
2025-08-07 23:26:02 +00:00
Chris Weaver
07c482f727 Make starter messages visible on smaller screens (#5170) 2025-08-07 16:49:18 -07:00
Wenxi
cf193dee29 feat: support gpt5 models (#5169)
* support gpt5 models

* gpt5mini visible
2025-08-07 12:35:46 -07:00
Evan Lohn
1b47fa2700 fix: remove erroneous error case and add valid error (#5163)
* fix: remove erroneous error case and add valid error

* also address docfetching-docprocessing limbo
2025-08-07 18:17:00 +00:00
Wenxi Onyx
e1a305d18a mask llm api key from logs 2025-08-07 00:01:29 -07:00
Evan Lohn
e2233d22c9 feat: salesforce custom query (#5158)
* WIP merged approach untested

* tested custom configs

* JT comments

* fix unit test

* CW comments

* fix unit test
2025-08-07 02:37:23 +00:00
Justin Tahara
20d1175312 feat(infra): Bump Vespa Helm Version (#5161)
* feat(infra): Bump Vespa Helm Version

* Adding the Chart.lock file
2025-08-06 19:06:18 -07:00
justin-tahara
7117774287 Revert that change. Let's do this properly 2025-08-06 18:54:21 -07:00
justin-tahara
77f2660bb2 feat(infra): Update Vespa Helm Chart Version 2025-08-06 18:53:02 -07:00
Wenxi
1b2f4f3b87 fix: slash command slackbot to respond in private msg (#5151)
* fix slash command slackbot to respond in private msg

* rename confusing variable. fix slash message response in DMs
2025-08-05 19:03:38 -07:00
Evan Lohn
d85b55a9d2 no more scheduled stalling (#5154) 2025-08-05 20:17:44 +00:00
Justin Tahara
e2bae5a2d9 fix(infra): Adding helm directory (#5156)
* feat(infra): Adding helm directory

* one more fix
2025-08-05 14:11:57 -07:00
Justin Tahara
cc9c76c4fb feat(infra): Release Charts on Github Pages (#5155) 2025-08-05 14:03:28 -07:00
Chris Weaver
258e08abcd feat: add customization via env vars for curator role (#5150)
* Add customization via env vars for curator role

* Simplify

* Simplify more

* Address comments
2025-08-05 09:58:36 -07:00
Evan Lohn
67047e42a7 fix: preserve error traces (#5152) 2025-08-05 09:44:55 -07:00
SubashMohan
146628e734 fix unsupported character error in minio migration (#5145)
* fix unsupported character error in minio migration

* slash fix
2025-08-04 12:42:07 -07:00
Wenxi
c1d4b08132 fix: minio file names (#5138)
* nit var clarity

* maintain file names in connector config for display

* remove unused util

* migration draft

* optional file names to not break existing instances

* backwards compatible

* backwards compatible

* migration logging

* update file ocnn tests

* unncessary none

* mypy + explanatory comments
2025-08-01 20:31:29 +00:00
Justin Tahara
f3f47d0709 feat(infra): Creating new helm chart action workflow (#5137)
* feat(infra) Creating new helm chart action workflow

* Adding the steps

* Adding in dependencies

* One more debug

* Adding a new step to install helm
2025-08-01 09:26:58 -07:00
Justin Tahara
fe26a1bfcc feat(infra): Codeowner for Helm directory (#5139) 2025-07-31 23:05:46 +00:00
Wenxi
554cd0f891 fix: accept multiple zip types and fallback to extension (#5135)
* accept multiple zip types and fallback to extension

* move zip check to util

* mypy nit
2025-07-30 22:21:16 +00:00
Raunak Bhagat
f87d3e9849 fix: Make ungrounded types have a default name when sending to the frontend (#5133)
* Update names in map-comprehension

* Make default name for ungrounded types public

* Return the default name for ungrounded entity-types

* Update backend/onyx/db/entities.py

Co-authored-by: cubic-dev-ai[bot] <191113872+cubic-dev-ai[bot]@users.noreply.github.com>

---------

Co-authored-by: cubic-dev-ai[bot] <191113872+cubic-dev-ai[bot]@users.noreply.github.com>
2025-07-30 20:46:30 +00:00
Rei Meguro
72cdada893 edit link to custom actions (#5129) 2025-07-30 15:08:39 +00:00
SubashMohan
c442ebaff6 Feature/GitHub permission sync (#4996)
* github perm sync initial draft

* introduce github  doc sync and perm sync

* remove specific start time check

* Refactor GitHub connector to use SlimCheckpointOutputWrapper for improved document handling

* Update GitHub sync frequency defaults from 30 minutes to 5 minutes

* Add stop signal handling and progress reporting in GitHub document sync

* Refactor tests for Confluence and Google Drive connectors to use a mock fetch function for document access

* change the doc_sync approach

* add static typing for ocument columns and where clause

* remove prefix logic in connector runner

* mypy fix

* code review changes

* mypy fix

* fix review comments

* add sort order

* Implement merge heads migration for Alembic and update Confluence and Google Drive test

* github unit tests fix

* delete merge head and rebase the docmetadata field migration

---------

Co-authored-by: Subash <subash@onyx.app>
2025-07-30 02:42:18 +00:00
Justin Tahara
56f16d107e feat(infra): Update helm version after new feature (#5120) 2025-07-29 16:31:35 -07:00
Justin Tahara
0157ae099a [Vespa] Update to optimized configuration pt.2 (#5113) 2025-07-28 20:42:31 +00:00
justin-tahara
565fb42457 Let's do this properly 2025-07-28 10:42:31 -07:00
justin-tahara
a50a8b4a12 [Vespa] Update to optimized configuration 2025-07-28 10:38:48 -07:00
Evan Lohn
4baf4e7d96 feat: pruning freq (#5097)
* pruning frequency increase

* add logs
2025-07-26 22:29:43 +00:00
Wenxi
8b7ab2eb66 onyx metadata minio fix + permissive unstructured fail (#5085) 2025-07-25 21:26:02 +00:00
Evan Lohn
1f75f3633e fix: sidebar ranges (#5084) 2025-07-25 19:46:47 +00:00
Evan Lohn
650884d76a fix: preserve error traces (#5083) 2025-07-25 18:56:11 +00:00
Wenxi
8722bdb414 typo (#5082) 2025-07-25 18:26:21 +00:00
Evan Lohn
71037678c3 attempt to fix parsing of tricky template files (#5080) 2025-07-25 02:18:35 +00:00
Chris Weaver
68de1015e1 feat: support aspx files (#5068)
* Support aspx files

* Add fetching of site pages

* Improve

* Small enhancement

* more improvements

* Improvements

* Fix tests
2025-07-24 19:19:24 -07:00
Evan Lohn
e2b3a6e144 fix: drive external links (#5079) 2025-07-24 17:42:12 -07:00
Evan Lohn
4f04b09efa add library to fall back to for tokenizing (#5078) 2025-07-24 11:15:07 -07:00
SubashMohan
5c4f44d258 fix: sharepoint lg files issue (#5065)
* add SharePoint file size threshold check

* Implement retry logic for SharePoint queries to handle rate limiting and server error

* mypy fix

* add content none check

* remove unreachable code from retry logic in sharepoint connector
2025-07-24 14:26:01 +00:00
Evan Lohn
19652ad60e attempt fix for broken excel files (#5071) 2025-07-24 01:21:13 +00:00
Evan Lohn
70c96b6ab3 fix: remove locks from indexing callback (#5070) 2025-07-23 23:05:35 +00:00
Raunak Bhagat
65076b916f refactor: Update location of sidebar (#5067)
* Use props instead of inline type def

* Add new AppProvider

* Remove unused component file

* Move `sessionSidebar` to be inside of `components` instead of `app/chat`

* Change name of `sessionSidebar` to `sidebar`

* Remove `AppModeProvider`

* Fix bug in how the cookies were set
2025-07-23 21:59:34 +00:00
PaulHLiatrio
06bc0e51db fix: adjust template variable from .Chart.AppVersion to .Values.global.version to match versioning pattern. (#5069) 2025-07-23 14:54:32 -07:00
Devin
508b456b40 fix: explicit api_server dependency on minio in docker compose files (#5066) 2025-07-23 13:37:42 -07:00
Evan Lohn
bf1e2a2661 feat: avoid full rerun (#5063)
* fix: remove extra group sync

* second extra task

* minor improvement for non-checkpointed connectors
2025-07-23 18:01:23 +00:00
Evan Lohn
991d5e4203 fix: regen api key (#5064) 2025-07-23 03:36:51 +00:00
Evan Lohn
d21f012b04 fix: remove extra group sync (#5061)
* fix: remove extra group sync

* second extra task
2025-07-22 23:24:42 +00:00
Wenxi
86b7beab01 fix: too many internet chunks (#5060)
* minor internet search env vars

* add limit to internet search chunks

* note

* nits
2025-07-22 23:11:10 +00:00
Evan Lohn
b4eaa81d8b handle empty doc batches (#5058) 2025-07-22 22:35:59 +00:00
Evan Lohn
ff2a4c8723 fix: time discrepancy (#5056)
* fix time discrepancy

* remove log

* remove log
2025-07-22 22:19:02 +00:00
Raunak Bhagat
51027fd259 fix: Make pr-labeler run on edits too 2025-07-22 15:04:37 -07:00
Raunak Bhagat
7e3fd2b12a refactor: Update the error message that is logged when PR title fails Conventional Commits regex (#5062) 2025-07-22 14:46:22 -07:00
Chris Weaver
d2fef6f0b7 Tiny launch.json template improvement (#5055) 2025-07-22 11:15:44 -07:00
Evan Lohn
bd06147d26 feat: connector indexing decoupling (#4893)
* WIP

* renamed and moved tasks (WIP)

* minio migration

* bug fixes and finally add document batch storage

* WIP: can suceed but status is error

* WIP

* import fixes

* working v1 of decoupled

* catastrophe handling

* refactor

* remove unused db session in prep for new approach

* renaming and docstrings (untested)

* renames

* WIP with no more indexing fences

* robustness improvements

* clean up rebase

* migration and salesforce rate limits

* minor tweaks

* test fix

* connector pausing behavior

* correct checkpoint resumption logic

* cleanups in docfetching

* add heartbeat file

* update template jsonc

* deployment fixes

* fix vespa httpx pool

* error handling

* cosmetic fixes

* dumb

* logging improvements and non checkpointed connector fixes

* didnt save

* misc fixes

* fix import

* fix deletion of old files

* add in attempt prefix

* fix attempt prefix

* tiny log improvement

* minor changes

* fixed resumption behavior

* passing int tests

* fix unit test

* fixed unit tests

* trying timeout bump to see if int tests pass

* trying timeout bump to see if int tests pass

* fix autodiscovery

* helm chart fixes

* helm and logging
2025-07-22 03:33:25 +00:00
Raunak Bhagat
1f3cc9ed6e Make from_.user optional (use "Unknown User") if not found (#5051) 2025-07-21 17:50:28 -07:00
Raunak Bhagat
6086d9e51a feat: Updated KG admin page (#5044)
* Update KG admin UI

* Styling changes

* More changes

* Make edits auto-save

* Add more stylings / transitions

* Fix opacity

* Separate out modal into new component

* Revert backend changes

* Update styling

* Add convenience / styling changes to date-picker

* More styling / functional updates to kg admin-page

* Avoid reducing opacity of active-toggle

* Update backend APIs for new KG admin page

* More updates of styling for kg-admin page

* Remove nullability

* Remove console log

* Remove unused imports

* Change type of `children` variable

* Update web/src/app/admin/kg/interfaces.ts

Co-authored-by: cubic-dev-ai[bot] <191113872+cubic-dev-ai[bot]@users.noreply.github.com>

* Update web/src/components/CollapsibleCard.tsx

Co-authored-by: cubic-dev-ai[bot] <191113872+cubic-dev-ai[bot]@users.noreply.github.com>

* Remove null

* Update web/src/components/CollapsibleCard.tsx

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>

* Force non-null

* Fix failing test

---------

Co-authored-by: cubic-dev-ai[bot] <191113872+cubic-dev-ai[bot]@users.noreply.github.com>
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
2025-07-21 15:37:27 -07:00
Raunak Bhagat
e0de24f64e Remove empty tooltip (#5050) 2025-07-21 12:45:48 -07:00
Rei Meguro
08b6b1f8b3 feat: Search and Answer Quality Test Script (#4974)
* aefads

* search quality tests improvement

Co-authored-by: wenxi-onyx <wenxi@onyx.app>

* nits

* refactor: config refactor

* document context + skip genai fix

* feat: answer eval

* more error messages

* mypy ragas

* mypy

* small fixes

* feat: more metrics

* fix

* feat: grab content

* typing

* feat: lazy updates

* mypy

* all at front

* feat: answer correctness

* use api key so it works with auth enabled

* update readme

* feat: auto add path

* feat: rate limit

* fix: readme + remove rerank all

* fix: raise exception immediately

* docs: improved clarity

* feat: federated handling

* fix: mypy

* nits

---------

Co-authored-by: wenxi-onyx <wenxi@onyx.app>
2025-07-19 01:51:51 +00:00
joachim-danswer
afed1a4b37 feat: KG improvements (#5048)
* improvements

* drop views if SQL fails

* mypy fix
2025-07-18 16:15:11 -07:00
Chris Weaver
bca18cacdf fix: improve assistant fetching efficiency (#5047)
* Improve assistant fetching efficiency

* More fix

* Fix weird build stuff

* Improve
2025-07-18 14:16:10 -07:00
Chris Weaver
335db91803 fix: improve check for indexing status (#5042)
* Improve check_for_indexing + check_for_vespa_sync_task

* Remove unused

* Fix

* Simplify query

* Add more logging

* Address bot comments

* Increase # of tasks generated since we're not going cc-pair by cc-pair

* Only index 50 user files at a time
2025-07-17 23:52:51 -07:00
Chris Weaver
67c488ff1f Improve support for non-default postgres schemas (#5046) 2025-07-17 23:51:39 -07:00
Wenxi
deb7f13962 remove chat session necessity from send message simple api (#5040) 2025-07-17 23:23:46 +00:00
Raunak Bhagat
e2d3d65c60 fix: Move around group-sync tests (since they require docker services to be running) (#5041)
* Move around tests

* Add missing fixtures + change directory structure up some more

* Add env variables
2025-07-17 22:41:31 +00:00
Raunak Bhagat
b78a6834f5 fix: Have document show up before message starts streaming back (#5006)
* Have document show up before message starts streaming back

* Add docs
2025-07-17 10:17:57 -07:00
Raunak Bhagat
4abe90aa2c fix: Fix Confluence pagination (#5035)
* Re-implement pagination

* Add note

* Fix invalid integration test configs

* Fix other failing test

* Edit failing test

* Revert test

* Revert pagination size

* Add comment on yielding style

* Use fixture instead of manually initializing sql-engine

* Fix failing tests

* Move code back and copy-paste
2025-07-17 14:02:29 +00:00
Raunak Bhagat
de9568844b Add PR labeller job (#4611) 2025-07-16 18:28:18 -07:00
Evan Lohn
34268f9806 fix bug in index swap (#5036) 2025-07-16 23:09:17 +00:00
Chris Weaver
ed75678837 Add suggested helm resource limits (#5032)
* Add resource suggestions for helm

* Adjust README

* fix

* fix lint
2025-07-15 15:52:16 -07:00
Chris Weaver
3bb58a3dd3 Persona simplification r2 (#5031)
* Revert "Revert "Reduce amount of stuff we fetch on `/persona` (#4988)" (#5024)"

This reverts commit f7ed7cd3cd.

* Enhancements / fix re-render

* re-arrange

* greptile
2025-07-15 14:51:40 -07:00
Chris Weaver
4b02feef31 Add option to disable my documents (#5020)
* Add option to disable my documents

* cleanup
2025-07-14 23:16:14 -07:00
Chris Weaver
6a4d49f02e More pruning logging (#5027) 2025-07-14 12:55:12 -07:00
Chris Weaver
d1736187d3 Fix full tenant sync (#5026) 2025-07-14 10:56:40 -07:00
Wenxi
0e79b96091 Feature/revised internet search (#4994)
* remove unused pruning config

* add env vars

* internet search date time toggle

* revised internet search supporting multiple providers

* env var

* simplify retries and fix mypy issues

* greptile nits

* more mypy

* please mypy

* mypy final straw

* cursor vs. mypy

* simplify fields from provider results

* type-safe prompt, enum nit, provider enums, indexingdoc processing change

---------

Co-authored-by: Wenxi Onyx <wenxi-onyx@Wenxis-MacBook-Pro.local>
2025-07-14 10:24:03 -07:00
Raunak Bhagat
ae302d473d Fix imap tests 2025-07-14 09:50:33 -07:00
Raunak Bhagat
feca4fda78 feat: Add frontend for email connector (#5008)
* Add basic structure for frontend email connector

* Update names of credentials-json keys

* Fix up configurations workflow

* Edit logic on how `mail_client` is used

- imaplib.IMAP4_SSL is supposed to be treated as an ephemeral object

* Edit helper name and add docs

* Fix invalid mailbox selection error

* Implement greptile suggestions

* Make recipients optional and add sender to primary-owners

* Add sender to external-access too; perform dedupe-ing of emails

* Simplify logic
2025-07-14 09:43:36 -07:00
Chris Weaver
f7ed7cd3cd Revert "Reduce amount of stuff we fetch on /persona (#4988)" (#5024)
This reverts commit adf48de652.
2025-07-14 09:20:50 -07:00
Chris Weaver
8377ab3ef2 Send over less data for document sets (#5018)
* Send over less data for document sets

* Fix type errors

* Fix tests

* Fixes

* Don't change packages
2025-07-13 22:47:05 +00:00
Chris Weaver
95c23bf870 Add full sync endpoint (#5019)
* Add full sync endpoint

* Update backend/ee/onyx/server/tenants/billing_api.py

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>

* Update backend/ee/onyx/server/tenants/billing_api.py

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>

* fix

---------

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
2025-07-13 13:59:19 -07:00
Chris Weaver
e49fb8f56d Fix pruning (#5017)
* Use better last_pruned time for never pruned connectors

* improved pruning req / refresh freq selections

* Small tweak

* Update web/src/lib/connectors/connectors.tsx

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>

---------

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
2025-07-12 14:46:00 -07:00
Chris Weaver
adf48de652 Reduce amount of stuff we fetch on /persona (#4988)
* claude stuff

* Send over less Assistant data

* more

* Fix build

* Fix mypy

* fix

* small tweak

* Address EL cmments

* fix

* Fix build
2025-07-12 14:15:31 -07:00
Wenxi Onyx
bca2500438 personas no longer overwrite when same name 2025-07-12 10:57:01 -07:00
Raunak Bhagat
89f925662f feat: Add ability to specify vertex-ai model location (#4955)
* Make constant a global

* Add ability to specify vertex location

* Add period

* Add a hardcoding path to the frontend

* Add docs

* Add default value to `CustomConfigKey`

* Consume default value from custom-config-key on frontend

* Use markdown renderer instead

* Update description
2025-07-11 16:16:12 -07:00
Chris Weaver
b64c6d5d40 Skip federated connectors when document sets are specified (#5015) 2025-07-11 15:49:13 -07:00
Raunak Bhagat
36c63950a6 fix: More small IMAP backend fixes (#5014)
* Make recipients an optional header and add IMAP to recognized connectors

* Add sender to external-access; perform dedupe-ing of emails
2025-07-11 20:06:28 +00:00
Raunak Bhagat
3f31340e6f feat: Add support for Confluence Macros (#5001)
* Remove macro stylings from HTML tree

* Add params

* Handle multiple cases of `ac:structured-macro` being found.

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>

---------

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
2025-07-11 00:34:49 +00:00
Raunak Bhagat
6ac2258c2e Fixes for imap backend (#5011) 2025-07-10 23:58:28 +00:00
Weves
b4d3b43e8a Add more error handling for drive group sync 2025-07-10 18:33:43 -07:00
Rei Meguro
ca281b71e3 add missing slack scope 2025-07-10 17:37:57 -07:00
Wenxi
9bd5a1de7a check file size first and clarify processing logic (#4985)
* check file size first and clarify processing logic

* basic gdrive extraction clariy

* typo

---------

Co-authored-by: Wenxi Onyx <wenxi-onyx@Wenxis-MacBook-Pro.local>
2025-07-10 00:48:36 +00:00
Wenxi Onyx
d3c5a4fba0 add docx fallback 2025-07-09 17:46:15 -07:00
Chris Weaver
f50006ee63 Stop fetching channel info to make pages load faster (#5005) 2025-07-09 16:45:45 -07:00
Evan Lohn
e0092024af add minIO to README 2025-07-09 16:36:18 -07:00
Evan Lohn
675ef524b0 add minio to contributing instructions 2025-07-09 16:33:56 -07:00
Evan Lohn
240367c775 skip id migration based on env var 2025-07-09 13:37:54 -07:00
Chris Weaver
f0ed063860 Allow curators to create public connectors / document sets (#4972)
* Allow curators to create public connectors / document sets

* Address EL comments
2025-07-09 11:38:56 -07:00
Rei Meguro
bcf0ef0c87 feat: original query + better slack expansion 2025-07-09 09:23:03 -07:00
Rei Meguro
0c7a245a46 Revert "feat: original query + better slack expansion"
This reverts commit 583d82433a.
2025-07-09 20:15:15 +09:00
Rei Meguro
583d82433a feat: original query + better slack expansion 2025-07-09 20:11:24 +09:00
Chris Weaver
391e710b6e Slack federated search ux (#4969)
* slack_search.py

* rem

* fix: get elements

* feat: better stack message processing

* fix: mypy

* fix: url parsing

* refactor: federated search

* feat: proper chunking + source filters

* highlighting + source check

* feat: forced section insertion

* feat: multi slack api queries

* slack query expansion

* feat: max slack queries env

* limit slack search to avoid overloading the search

* Initial draft

* more

* simpify

* Improve modal

* Fix oauth flow

* Fully working versino

* More nicities

* Improved cascade delete

* document set for fed connector UI

* Fix source filters + improve document set selection

* Improve callback modal

* fix: logging error + showing connectors in admin page user settings

* better log

* Fix mypy

* small rei comment

* Fix pydantic

* Improvements to modals

* feat: distributed pruning

* random fix

* greptile

* Encrypt token

* respect source filters + revert llm pruning ordering

* greptile nit

* feat: thread as context in slack search

* feat: slack doc ordering

* small improvements

* rebase

* Small improvements

* Fix web build

* try fix build

* Move to seaprate model file

* Use default_factory

* remove unused model

---------

Co-authored-by: Rei Meguro <36625832+Orbital-Web@users.noreply.github.com>
2025-07-08 21:35:51 -07:00
Raunak Bhagat
004e56a91b feat: IMAP connector (#4987)
* Implement fetching; still need to work on document parsing

* Add basic skeleton of parsing email bodies

* Add id field

* Add email body parsing

* Implement checkpointed imap-connector

* Add testing logic for basic iteration

* Add logic to get different header if "to" isn't present

- possible in mailing-list workflows

* Add ability to index specific mailboxes

* Add breaking when indexing has been fully exhausted

* Sanitize all mailbox names + add space between stripped strings after parsing

* Add multi-recipient parsing

* Change around semantic-identifier and title

* Add imap tests

* Add recipients and content assertions to tests

* Add envvars to github actions workflow file

* Remove encoding header

* Update logic to not immediately establish connection upon init of `ImapConnector`

* Add start and end datetime filtering + edit when connection is established / how login is done

* Remove content-type header

* Add note about guards

* Change default parameters to be `None` instead of `[]`

* Address comment on PR

* Implement more PR suggestions

* More PR suggestions

* Implement more PR suggestions

* Change up login/logout flow (PR suggestion)

* Move port number to be envvar

* Make globals variants in enum instead (PR suggestion)

* Fix more documentation related suggestions on PR

* Have the imap connector implement `CheckpointedConnectorWithPermSync` instead

* Add helper for loading all docs with permission syncing
2025-07-08 23:58:22 +00:00
Evan Lohn
103300798f Bugfix/drive doc ids3 (#4998)
* fix migration

* fix migration2

* cursor based pages

* correct vespa URL

* fix visit api index name

* use correct endpoint and query
2025-07-07 18:23:00 +00:00
Evan Lohn
8349d6f0ea Bugfix/drive doc ids (#4990)
* fixed id extraction in drive connector

* WIP migration

* full migration script

* migration works single tenant without duplicates

* tested single tenant with duplicate docs

* migrations and frontend

* tested mutlitenant

* fix connector tests

* make tests pass
2025-07-06 01:59:12 +00:00
Emerson Gomes
cd63bf6da9 Re-adding .epub file support (#4989)
.epub files apparently were forgotten and were not allowed for upload in the frontend.
2025-07-05 07:48:55 -07:00
Rei Meguro
5f03e85195 fireflies metadata update (#4993)
* fireflies metadata

* str
2025-07-04 18:39:41 -07:00
Raunak Bhagat
cbdbfcab5e fix: Fix bug with incorrect model icon being shown (#4986)
* Fix bug with incorrect model icon being shown

* Update web/src/app/chat/input/LLMPopover.tsx

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>

* Update web/src/app/chat/input/LLMPopover.tsx

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>

* Update web/src/app/chat/input/LLMPopover.tsx

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>

* Update web/src/app/chat/input/LLMPopover.tsx

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>

* Add visibility to filtering

* Update the model names which are shown in the popup

* Fix incorrect llm updating bug

* Fix bug in which the provider name would be used instead

---------

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
2025-07-04 01:39:50 +00:00
SubashMohan
6918611287 remove check for folder assistant before uploading (#4975)
Co-authored-by: Subash <subash@onyx.app>
2025-07-03 09:25:03 -07:00
Chris Weaver
b0639add8f Fix migration (#4982) 2025-07-03 09:18:57 -07:00
Evan Lohn
7af10308d7 drive service account shared fixes (#4977)
* drive service account shared fixes

* oops

* ily greptile

* scrollable index attempt errors

* tentatively correct index errors page, needs testing

* mypy

* black

* better bounds in practice

* remove random failures

* remove console log

* CW
2025-07-02 16:56:32 -07:00
Rei Meguro
5e14f23507 mypy fix 2025-07-01 23:00:33 -07:00
Raunak Bhagat
0bf3a5c609 Add type ignore for dynamic sqlalchemy class (#4979) 2025-07-01 18:14:35 -07:00
Emerson Gomes
82724826ce Remove hardcoded image extraction flag for PDFs
PDFs currently always have their images extracted.
This will make use of the "Enable Image Extraction and Analysis" workspace configuration instead.
2025-07-01 13:57:36 -07:00
Wenxi
f9e061926a account for category prefix added by user (#4976)
Co-authored-by: Wenxi Onyx <wenxi-onyx@Wenxis-MacBook-Pro.local>
2025-07-01 10:39:46 -07:00
Chris Weaver
8afd07ff7a Small gdrive perm sync enhancement (#4973)
* Small gdrive perm sync enhancement

* Small enhancement
2025-07-01 09:33:45 -07:00
Evan Lohn
6523a38255 search speedup (#4971) 2025-07-01 01:41:27 +00:00
Yuhong Sun
264878a1c9 Onyx Metadata Header for File Connector (#4968) 2025-06-29 16:09:06 -07:00
Weves
e480946f8a Reduce frequency of heavy checks on primary for cloud 2025-06-28 17:56:34 -07:00
Evan Lohn
be25b1efbd perm sync validation framework (#4958)
* perm synce validation framework

* frontend fixes

* validate perm sync when getting runner

* attempt to fix integration tests

* added new file

* oops

* skipping salesforce test due to creds

* add todo
2025-06-28 19:57:54 +00:00
Chris Weaver
204493439b Move onyx_list_tenants.py to make sure it's in the image (#4966)
* Move onyx_list_tenants.py to make sure it's in the image

* Improve
2025-06-28 13:18:14 -07:00
Weves
106c685afb Remove CONCURRENTLY from migrations 2025-06-28 11:59:59 -07:00
Raunak Bhagat
809122fec3 fix: Fix bug in which emails would be fetched during initial indexing (#4959)
* Add new convenience method

* Fix bug in which emails would be fetched for initial indexing

* Improve tests for MS Teams connector

* Fix test_gdrive_perm_sync_with_real_data patching

* Protect against incorrect truthiness

---------

Co-authored-by: Weves <chrisweaver101@gmail.com>
2025-06-27 22:05:50 -07:00
Chris Weaver
c8741d8e9c Improve mt migration process (#4960)
* Improve MT migration process

* improve MT migrations

* Improve parallel migration

* Add additional options to env.py

* Improve script

* Remove script

* Simplify

* Address greptile comment

* Fix st migration

* fix run_alembic_migrations
2025-06-27 17:31:22 -07:00
Weves
885f01e6a7 Fix test_gdrive_perm_sync_with_real_data patching 2025-06-27 16:34:37 -07:00
Rei Meguro
3180a13cf1 source fix (#4956) 2025-06-27 13:20:42 -07:00
Rei Meguro
630ac31355 KG vespa error handling + separating relationship transfer & vespa updates (#4954)
* feat: move vespa at end in try block

* simplify query

* mypy

* added order by just in case for consistent pagination

* liveness probe

* kg_p check for both extraction and clustering

* fix: better vespa logging
2025-06-26 22:05:57 -07:00
Chris Weaver
80de62f47d Improve drive group sync (#4952)
* Improve drive group sync

* Improve group syncing approach

* Fix github action

* Improve tests

* address greptile
2025-06-26 20:14:35 -07:00
Raunak Bhagat
c75d42aa99 perf: Improve performance of MS Teams permission-syncing logic (#4953)
* Add function stubs for Teams

* Implement more boilerplate code

* Change structure of helper functions

* Implement teams perms for the initial index

* Make private functions start with underscore

* Implement slim_doc retrieval and fix up doc_sync

* Simplify how doc-sync is done

* Refactor jira doc-sync

* Make locally used function start with an underscore

* Update backend/ee/onyx/configs/app_configs.py

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>

* Add docstring to helper function

* Update tests

* Add an expected failure

* Address comment on PR

* Skip expert-info if user does not have a display-name

* Add doc comments

* Fix error in generic_doc_sync

* Move callback invocation to earlier in the loop

* Update tests to include proper list of user emails

* Update logic to grab user emails as well

* Only fetch expert-info if channel is not public

* Pull expert-info creation outside of loop

* Remove unnecessary call to `iter`

* Switch from `dataclass` to `BaseModel`

* Simplify boolean logic

* Simplify logic for determining if channel is public

* Remove unnecessary channel membership-type

* Add log-warns

* Only perform another API fetch if email is not present

* Address comments on PR

* Add message on assertion failure

* Address typo

* Make exception message more descriptive

---------

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
2025-06-27 01:41:01 +00:00
Raunak Bhagat
e1766bca55 feat: MS Teams permission syncing (#4934)
* Add function stubs for Teams

* Implement more boilerplate code

* Change structure of helper functions

* Implement teams perms for the initial index

* Make private functions start with underscore

* Implement slim_doc retrieval and fix up doc_sync

* Simplify how doc-sync is done

* Refactor jira doc-sync

* Make locally used function start with an underscore

* Update backend/ee/onyx/configs/app_configs.py

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>

* Add docstring to helper function

* Update tests

* Add an expected failure

* Address comment on PR

* Skip expert-info if user does not have a display-name

* Add doc comments

* Fix error in generic_doc_sync

* Move callback invocation to earlier in the loop

* Update tests to include proper list of user emails

* Update logic to grab user emails as well

* Only fetch expert-info if channel is not public

* Pull expert-info creation outside of loop

* Remove unnecessary call to `iter`

* Switch from `dataclass` to `BaseModel`

* Simplify boolean logic

* Simplify logic for determining if channel is public

* Remove unnecessary channel membership-type

* Add log-warns

---------

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
2025-06-26 22:36:09 +00:00
Rei Meguro
211102f5f0 kg cleanup + reintroducing deep extraction & classification (#4949)
* kg cleanup

* more cleanup

* fix: copy over _get_classification_content_from_call_chunks for content formatting

* added back deep extraction logic

* feat: making deep extraction and clustering work

* nit
2025-06-26 14:46:50 -07:00
Weves
c46cc4666f Fix query history 2 2025-06-25 21:35:53 -07:00
joachim-danswer
0b2536b82b expand definition of public 2025-06-25 20:01:09 -07:00
Rei Meguro
600a86f11d Add creator to linear (#4948)
* add creator to linear

* fix: mypy
2025-06-25 18:19:36 -07:00
Rei Meguro
4d97a03935 KG Attribute Overhaul + Processing Tests (#4933)
* feat: extract email

* title

* feat: new type definition

* working

* test and bugfix

* fix: set docid

* fix: mypy

* feat: show implied entities too

* fix import + migration

* fix: added random delay for vespa

* fix: mypy

* mypy again...

* fix: nit

* fix: mypy

* SOLUTION!

* fix

* cleanup

* fix: transfer

* nit

---------

Co-authored-by: joachim-danswer <joachim@danswer.ai>
2025-06-25 05:06:12 +00:00
Raunak Bhagat
5d7169f244 Implement JIRA permission syncing (#4899) 2025-06-24 23:59:26 +00:00
Wenxi
df9329009c curator bug fixes (#4941)
* curator bug fixes

* basic users default to my files

* fix admin param + move delete button

* fix trashcan admin only

---------

Co-authored-by: Wenxi Onyx <wenxi-onyx@Wenxis-MacBook-Pro.local>
2025-06-24 21:33:47 +00:00
Arun Philip
e74a0398dc Update Docker Compose restart policy to unless-stopped
Changed the restart policy to unless-stopped to ensure containers
automatically restart after failures or reboots but allow manual stop
without immediate restart.

This is preferable over always because it prevents containers from
restarting automatically after a manual stop, enabling controlled
shutdowns and maintenance without unintended restarts.
2025-06-24 13:27:50 -07:00
SubashMohan
94c5822cb7 Add MinIO configuration to env template and update restart script for MinIO container (#4944)
Co-authored-by: Subash <subash@onyx.app>
2025-06-24 17:21:16 +00:00
joachim-danswer
dedac55098 KG extraction without vespa queries (#4940)
* no vespa in extraction

* prompt/flow improvements

* EL comments

* nit

* Updated get_session_with_current_tenant import

---------

Co-authored-by: Rei Meguro <36625832+Orbital-Web@users.noreply.github.com>
2025-06-24 15:02:50 +00:00
Chris Weaver
2bbab5cefe Handle very long file names (#4939)
* Handle very long file names

* Add logging

* Enhancements

* EL comments
2025-06-23 19:22:02 -07:00
joachim-danswer
4bef718fad fix kg db proxy (#4942) 2025-06-23 18:27:59 -07:00
Chris Weaver
e7376e9dc2 Add support for db proxy (#4932)
* Split up engine file

* Switch to schema_translate_map

* Fix mass serach/replace

* Remove unused

* Fix mypy

* Fix

* Add back __init__.py

* kg fix for new session management

Adding "<tenant_id>" in front of all views.

* additional kg fix

* better handling

* improve naming

---------

Co-authored-by: joachim-danswer <joachim@danswer.ai>
2025-06-23 17:19:07 -07:00
Raunak Bhagat
8d5136fe8b Fix error in which curator sidebars were hitting kg-exposed endpoint 2025-06-23 17:07:11 -07:00
joachim-danswer
3272050975 docker dev and prod template (#4936)
* docker dev and prod template

* more dev files
2025-06-23 21:43:42 +00:00
Weves
1960714042 Fix query history 2025-06-23 14:32:14 -07:00
Weves
5bddb2632e Fix parallel tool calls 2025-06-23 09:50:44 -07:00
Raunak Bhagat
5cd055dab8 Add minor type-checking fixes (#4916) 2025-06-23 13:34:40 +00:00
Raunak Bhagat
fa32b7f21e Update ruff and remove ruff-formating from pr checks (#4914) 2025-06-23 05:34:34 -07:00
Rei Meguro
37f7227000 fix: too many vespa request fix (#4931) 2025-06-22 14:31:42 -07:00
Chris Weaver
c1f9a9d122 Hubspot connector enhancements (#4927)
* Enhance hubspot connector

* Add companies, deals, and tickets

* improve typing

* Add HUBSPOT_ACCESS_TOKEN to connector tests

* Fix prettier

* Fix mypy

* Address JR comments
2025-06-22 13:54:04 -07:00
Rei Meguro
045b7cc7e2 feat: comma separated citations (#4923)
* feat: comam separated citations

* nit

* fix

* fix: comment
2025-06-21 22:51:32 +00:00
joachim-danswer
970e07a93b Forcing vespa language 2025-06-21 16:12:13 -07:00
joachim-danswer
d463a3f213 KG Updates (#4925)
* updates

 - no classification if deep extraction is False
 - separate names for views in LLM generation
 - better prompts
 - any relationship type provided to LLM that relates to identified entities

* CW feedback/comment update
2025-06-21 20:16:39 +00:00
Wenxi
4ba44c5e48 Fix no subject gmail docs (#4922)
Co-authored-by: Wenxi Onyx <wenxi-onyx@Wenxis-MacBook-Pro.local>
2025-06-20 23:22:49 +00:00
Chris Weaver
6f8176092e S3 like file store (#4897)
* Move to an S3-like file store

* Add non-mocked test

* Add S3 tests

* Improve migration / add auto-running tests

* Refactor

* Fix mypy

* Small fixes

* Improve migration to handle downgrades

* fix file store tests

* Fix file store tests again

* Fix file store tests again

* Fix mypy

* Fix default values

* Add MinIO to other compose files

* Working helm w/ minio

* Fix test

* Address greptile comments

* Harden migration

* Fix README

* Fix it

* Address more greptile comments

* Fix it

* Rebase

* Handle multi-tenant case

* Fix mypy

* Fix test

* fix test

* Improve migration

* Fix test
2025-06-20 14:22:05 -07:00
Wenxi
198ec417ba fix gemini model names + add vertex claude sonnet 4 (#4920)
* fix gemini model names + add vertex claude sonnet 4

* few more models

---------

Co-authored-by: Wenxi Onyx <wenxi-onyx@Wenxis-MacBook-Pro.local>
2025-06-20 18:18:36 +00:00
Wenxi
fbdf7798cf GCS metadata processing (#4879)
* GCS metadata processing

* Unprocessable files should still be indexed to be searched by title

* Moved re-used logic to utils. Combined file metadata PR with GCS metadata changes

* Added OnyxMetadata type, adjusted timestamp naming consistency, clarified timestamp logic

* Use BaseModel

---------

Co-authored-by: Wenxi Onyx <wenxi-onyx@Wenxis-MacBook-Pro.local>
2025-06-20 16:11:38 +00:00
Weves
7bd9c856aa Really add psql to api-server 2025-06-19 18:50:17 -07:00
Rei Meguro
948c719d73 fix (#4915) 2025-06-19 23:06:34 +00:00
Weves
42572479cb Don't load prompts if not necessary 2025-06-19 16:56:33 -07:00
Suvodhoy Sinha
accd363d3f fix(discourse-connector): handle redirect issue with categoryId rewriting page number (#4780) 2025-06-19 16:21:09 -07:00
Rei Meguro
8cf754a8b6 Kg config refactor (#4902)
* refactor: kg_config

* feat: reworked migrations

* nit

* fix: test

* rebase alembic migration

* feat: bypass cache

* fix: mypy

* fix: processing when kg disabled

* feat: celery rework

* fix: grammar

* fix: only do kg commands for KG Beta

* fix: keep config on downgrade

* fix: nit
2025-06-19 20:25:56 +00:00
Raunak Bhagat
bf79220ac0 build: Remove ruff (#4912)
* Update ruff version

* Update format command

* Update pyproject.toml

* Remove line-length

* Remove ruff in general
2025-06-18 19:12:59 -07:00
Weves
4c9dc14e65 ADd slackbot to helm 2025-06-18 11:04:25 -07:00
Weves
f8621f7ea9 Add psql to backend containers 2025-06-17 21:15:40 -07:00
trial-danswer
e0e08427b9 Feature/connector creation feedback (#4644)
* Loading on connector creation

* Dangling connectors cleaned up. Fixed loading modal.

* Dangling connector deletion happens immediately at timeout. Swapped loading modal to spinner for consistency

* Removed redundant delete func
2025-06-17 18:32:37 -07:00
Evan Lohn
169df994da tiny connector logging tweaks (#4908) 2025-06-17 22:13:40 +00:00
Evan Lohn
d83eaf2efb fail loudly when error should be propagated (#4903) 2025-06-17 22:12:19 +00:00
joachim-danswer
4e1e30f751 KG - Entity-Only Path (#4898)
* Create Entity-Only path for simple entity-focussed queries. Plus
other fixes.

* fix: use env var

* mypy fix

* fix: mypy

---------

Co-authored-by: Rei Meguro <36625832+Orbital-Web@users.noreply.github.com>
2025-06-17 22:10:29 +00:00
Raunak Bhagat
561f8c9c53 fix: Implement time-filtering for MS Teams document fetching (#4906)
* Add delta-time filtering

* Remove unused variables

* Update retry logic

* Remove variable change (inside of overriden function)

* Add back helpful variables

* Add missing assignment to variable

* Reorder classes in order to avoid using quotes

* Compress f-strings

* Address PR comment

* Implement pagination
2025-06-17 21:06:04 +00:00
SubashMohan
f625a4d0a7 feat: Add support for Assume Role authentication in S3 (#4907)
Co-authored-by: Subash <subash@onyx.app>
2025-06-17 20:56:44 +00:00
rkuo-danswer
746d4b6d3c Bugfix/salesforce correctness 3 (#4598)
* refactor salesforce sqlite db access

* more refactoring

* refactor again

* refactor again

* rename object

* add finalizer to ensure db connection is always closed

* avoid unnecessarily nesting connections and commit regularly when possible

* remove db usage from csv download

* dead code

* hide deprecation warning in ddtrace

* remove unused param

* local testing WIP

* stuff for pytest-dotenv

* autodetect filter types instead of assuming last modified always works (it doesn't)
Move filtering responsibility up instead of making utility calls excessively stateful

* fix how changed parent id's are yielded

* remove slow part of test

* clean up comments

* small refactor

* more refactor

* add normalize test

* checkpoint and comments

* add helper function

* fix gitignore

* add gitignore

* update pyproject

* delta updates

* remove comments

* fix time import

* fix set init

* add salesforce env vars

* cleanup

* more cleanup

* filtered item is unbound here

* typo

* fix suffix check

* fix empty type query

---------

Co-authored-by: Richard Kuo (Onyx) <rkuo@onyx.app>
2025-06-17 18:39:22 +00:00
Evan Lohn
fdd48c6588 fix db connection assertion (#4905) 2025-06-17 03:42:55 +00:00
Rei Meguro
23a04f7b9c Kg Subtype Rework (#4892)
* feat: vespa schema update

* fix: vespa multiple entities/relations yql logic

* fix: mypy

* fix: comments

* fix: kgchunkformat

* fix: reset vespa fix

* feat: vespa schema update

* feat: modify entity type and attribute value extraction

* feat: modify entity type and attribute value extraction

* feat: removed entity class and subtype from db

* slightly formatting

* feat: subtype narrowed normalization

* fix: mypy

* nits

* fix: rebase error fix

* fix: null handling

* rename for clarity

* fix: reverse order downgrade

* fix: nit

* rebase leftovers
2025-06-17 01:17:35 +00:00
Chris Weaver
b7b0dde7aa Remove non-helm kubernetes deployment option (#4904)
* Remove non-helm kubernetes deployment option

* Improve Vespa default set up

* Make nginx LoadBalancer

* Add version in values.yaml

* Fix lint

* Fix typo
2025-06-16 18:27:23 -07:00
Wenxi
c40b78c7e9 Bugfix/honor disable default slack config (#4891)
* Honor disable default config & improve UI clarity

* Disable default will also disable DMs

---------

Co-authored-by: Wenxi Onyx <wenxi-onyx@Wenxis-MacBook-Pro.local>
2025-06-16 22:24:51 +00:00
Raunak Bhagat
33c0133cc7 feat: Knowledge graph full-stack implementation (#4790)
* db setup

* transfer 1 - incomplete

* more adjustments

* relationship table + query update

* temp view creation

* restructuring

* nits

* updates

* separate read_only engine

* extraction revamp

* focus on metadata relatonships 1

* dev

* migration downgrade fix

* rebase migration change

* a3+

* progress

* base

* new extraction

* progress

* fixed KG extraction

* nits

* updates

* simplifications & cleanup

* fixes

* updates

* more feature flag checks

* fixes

* extraction process fix

* read-only user creation as part of setup

* fix for missing entity attributes

* kg read-only user creation as part of migration

* typo

* EL initial comments

* initial Account/SF Connector chnges

* SF Connector update

 - include account information

* base w/ salesforce

* evan updates + quite a bit more

* kg-filtered search

* EL changes pt 2

* migrations and env vars

* quick migration fix

* migration update

* post_rebase fixes

* mypy fixes

* test fixes

* test fix

* test fix

* read_only pool + misc

* nf

* env vars

* test improvements

* salesforce fix

* test update

* small changes

* small adjustments

* SF Connector fix & kg_stage removal for one table

* mypy fix

* small fixes

* EL + RK (pt 1) comments

* nit

* setting updated

* Salesforce test update

* EL comments

* read-only user replacement & cleanup

* SQL View fix

* converting entity type-name separators

* sql view group ownership

* view fix

* SQL tweak

* dealing with docs that were skipped by indexing

* increased error handling

* more error handling

* Output formatting fix

* kg-incremental-reindexing

* 0-doc found improvement

* celery

* migration correction

* timeout adjustments

* nit

* Updated migration

* Entity Normalization for KG Dev 1 (#4746)

* feat: trigrams column

* fix: reranking and db

* feat: v1

* fix: convert to orm

* feat: parallel

* fix: default to id_name

* fix: renamed semantic_id and semantic_id_trigrams

* fix: scalar subquery

* fix: tuning + redundancy

* fix: threshold

* fix: typo

* fix: shorten names

* wip

* fix: reverted

* feat: config

* feat: works but it was dumb

* feat: clustering works

* fix: mypy

* normalization <-> language awareness for SQL generation

* small type fixes

---------

Co-authored-by: joachim-danswer <joachim@danswer.ai>

* mypy

* typo and dead code

* kg_time_fencing

* feat: remove temp views on migration downgrade

* remove functions and triggers for now

* rebase adjustments

* EL code review results

* quick fix + trigger/funcs for single tenant

* fix: typo, mypy, dead code

* fix: autoflake

* small updatesd

* nit

* fix: typo

* early + faster view creation

* Extension creation in MT migration

* nit changes to default ETs

* Incremental Clustering and KG Refactor V1 (#4784)

Optimized/restructured incremental clustering. New pipeline actually that moves vespa updates to clustering.
Also, celery configuration has been updated.
---------

Co-authored-by: joachim-danswer <joachim@danswer.ai>

* Move file

* Fix all prior imports

* Clean sidebar items logic; add kg page

* Add kg_processing celery background task

* prompt tweak & ET extraction reset

* more general hierarchical structure

* feat: better vespa reset logic

* Add basic knowledge graph configuration

* Add configurations for KG entity-type

* prompt optimization and entity replacemants

* small prompt changes

* Implement backend APIs

* KG Refactor V2 (#4814)

Clustering & Extraction improvements & various nits 

Co-authored-by: joachim-danswer <joachim@danswer.ai>

* add connector-level coverage days

* Update APIs to be more frontend ergonomic

* Add simple test

* Make config optional in test

* fix: nit

* initial  EL responses

* refactor: helper functions for formatting

* fix: more helper fns & comments

* fix: comment code that's been implemented elsewhere

* Add entity-types APIs

* Hook up frontend to backend

* Finish hookup up entity-types to backend

* Update ordering of entity-types and fix form submitting

* Add backend API to get kg-exposed

* Add kg-exposed to sidebar

* Fix path

* Use existing values, even if kg-enabled is false

* Update what initial values are used

* Add skeleton for kg resetting

* Add return type

* Add default entity-type population when fetching entity-types

* Remove circular deps

* Minor fixes to logic

* Edit logic for default entity-types population

* Add re-index API + skeleton

* Update verbiage for KG

* Remove templatization in favour of function

* Address comments on PR

* Pull call out into its own binding

* Remove re-index API and revert implement of reset back to stub

* Fix circular import error

* Remove 'reindex' button

* Edit how the empty vendor name list is handled

* Edit how exposed is processed

* Redirect if navigated to `/admin/kg` and kg is not exposed

* Address comments on PR

* reset + entity type table display & updating updates

* Update fetching entity-types

* Make KG entity types refresh when reset

* Edit verbiage of reset button

* Update package-lock.json file

* Protect against overflowing

* Re-implement refreshing table after reset

* Edit message when nothing is shown.

* UI enhancements

* small fixes

* remove form validation?

* fix

* nit

* nit

* nit

* nit

* fix configure max coverage days

* EL comments for JR

* refactor: moved functions where they belong to fix circular import

* feat: intuitive coverage days

* feat: intuitive coverage days

* fix: safe date picker

* fix: startdate

* evan fixes

* fix: evan comment on enable/disable

* fix: style

* fix: ui issues

* fix: ui issues for reset too

* fix: tests

* fix: kg entity is not enabled

* fix: entity type reload on enable

---------

Co-authored-by: joachim-danswer <joachim@danswer.ai>
Co-authored-by: Rei Meguro <36625832+Orbital-Web@users.noreply.github.com>
2025-06-16 15:51:11 +00:00
Shahar Mazor
cca5bc13dc Make password validation configurable (#4869) 2025-06-14 14:31:05 -07:00
Wenxi
d5ecaea8e7 new script for hard deleting sessions (#4883)
Co-authored-by: Wenxi Onyx <wenxi-onyx@Wenxis-MacBook-Pro.local>
2025-06-14 01:06:47 -07:00
joachim-danswer
b6d3c38ca9 Prep KG on-demand indexing through celery (#4874)
* filter updates

* nits

* nit

* moving to celery

* RK discussion updates

* fix of postgres reset logic

* greptile comments

* RK comments

* fix

* change num_chunks

* further hardening

* nit

* added logging

* fix: mypy and argument to function

* feat: log so we know when rs finishes

* nits

* nit

---------

Co-authored-by: Rei Meguro <36625832+Orbital-Web@users.noreply.github.com>
2025-06-13 18:00:02 +00:00
Rei Meguro
b5fc1b4323 fix kg (#4881) 2025-06-12 15:44:14 -07:00
rkuo-danswer
a1a9c42b0b bump disk size (#4882)
Co-authored-by: Richard Kuo (Onyx) <rkuo@onyx.app>
2025-06-12 20:19:49 +00:00
Raunak Bhagat
e689e143e5 cleanup: Edit logic for default entity-types population (#4876)
* Edit logic for default entity-types population

* Remove templatization in favour of function

* Address comments on PR

* Pull call out into its own binding

* Address comments on PR
2025-06-12 08:46:32 -07:00
joachim-danswer
a7a168d934 Dual search pipeline for non-tool-calling LLMs (#4872)
Added dual pipeline also for non-tool-calling LLMs. 
A helper function was created.
2025-06-11 17:43:44 -07:00
joachim-danswer
69f47fc3e3 kg_update (#4858)
Improving Vespa chunk retrieval

Co-authored-by: Weves <chrisweaver101@gmail.com>
2025-06-11 17:40:20 -07:00
Evan Lohn
8a87140b4f JR comments 2025-06-11 15:51:11 -07:00
Evan Lohn
53db0ddc4d skip large empty excel files 2025-06-11 15:51:11 -07:00
Rei Meguro
087085403f fix: kg answer streaming (#4877)
* fix: answer streaming

* fix: env vars

* fix: remove duplicate
2025-06-11 15:47:32 -07:00
Chris Weaver
c040b1cb47 Switch to chonkie from llamaindex chunker (#4838)
* Switch to chonkie from llamaindex chunker

* Remove un-intended changes

* Order requirements

* Upgrade chonkie version
2025-06-11 14:12:52 -07:00
Raunak Bhagat
1f4d0716b9 Remove invocation of parallel_yield (was causing problems) 2025-06-11 12:19:05 -07:00
SubashMohan
aa4993873f feat: add configurable image model name and update dependencies (#4873)
Co-authored-by: Subash <subash@onyx.app>
2025-06-11 16:17:11 +00:00
Raunak Bhagat
ce031c4394 Change how replies are processed (#4870) 2025-06-11 02:14:34 +00:00
Rei Meguro
d4dadb0dda feat: default kg entity types (#4851)
* feat: default entity types

* refactor: cleaned duplicate code

* fix: delete unused fn

* refactor: dictionary instead of model for better type safety

* fix: mypy
2025-06-10 18:19:41 -07:00
Raunak Bhagat
0ded5813cd fix: Add rate-limiting to Teams API request (#4854)
* Add rate-limiting to Teams API request

* Add comment for rate-limiting

* Implement rate-limiting for office365 library.

* Remove hardcoded value

* Fix nits on PR
2025-06-10 21:12:06 +00:00
Evan Lohn
83137a19fb remove lru cache (#4865)
* remove lru cache

* fix types issue

* mypy
2025-06-10 19:40:07 +00:00
Zhipeng He
be66f8dbeb add: add Qwen icon in LLM list and update provider icon mapping (#4625) 2025-06-10 12:36:18 -07:00
SubashMohan
70baecb402 Enhancement/gpt4o image gen support (#4859)
* initial model switching changes

* Update image generation output format and revise prompt handling

* Add validation for output format in ImageGenerationTool and implement tests

---------

Co-authored-by: Subash <subash@onyx.app>
2025-06-10 08:52:21 -07:00
Chris Weaver
c27ba6bad4 Add perm sync to indexing for google drive (#4842)
* Add perm sync to indexing for google drive

* Applying changes elsewhere

* Turn on EE for perm sync slack tests

* Add new load_from_checkpoint_with_perm_sync

* Adjust way perm sync configs are represented

* Adjust run_indexing to handle perm sync on first run

* Add missing file

* Add sync on index for slack

* Add test + fixes

* Update permission

* Fix connector tests

* skip perm sync test if running MIT tests

* Address EL comments
2025-06-10 02:36:04 +00:00
Evan Lohn
61fda6ec58 drive smaller checkpoints v1 (#4849)
* drive smaller checkpoints v1

* v2

* text encoding fix
2025-06-09 23:35:12 +00:00
Evan Lohn
2c93841eaa errors have correct file id (#4818) 2025-06-09 22:36:35 +00:00
Weves
879db3391b Enable embedding parallelism 2025-06-09 15:36:36 -07:00
Suvodhoy Sinha
6dff3d41fa fix(discourse): Remove early break that was limiting topics to batch size 2025-06-09 15:20:43 -07:00
Raunak Bhagat
e399eeb014 fix: Query History Export (#4841)
* Move task registration to earlier in the API

* Remove unnecessary check
2025-06-09 10:10:53 -07:00
joachim-danswer
b133e8fcf0 enforce sub_question ordering for chat_message (#4848) 2025-06-09 02:37:35 +00:00
SubashMohan
ba9b24a477 Enhance credential management with multi-auth support and improved validation (#4846)
Co-authored-by: Subash <subash@onyx.app>
2025-06-09 00:53:18 +00:00
Rei Meguro
cc7fb625a6 KG autofill metadata (#4834)
* feat: autofill metadata

* fix: typo

* fix: enum

* fix: nit
2025-06-08 21:56:50 +00:00
Rei Meguro
2b812b7d7d Kg batch clustering (#4847)
* super genius kg_entity parent migration

* feat: batched clustering

* fix: nit
2025-06-08 21:16:10 +00:00
joachim-danswer
c5adbe4180 Knowledge Graph v1 (#4626)
* db setup

* transfer 1 - incomplete

* more adjustments

* relationship table + query update

* temp view creation

* restructuring

* nits

* updates

* separate read_only engine

* extraction revamp

* focus on metadata relatonships 1

* dev

* migration downgrade fix

* rebase migration change

* a3+

* progress

* base

* new extraction

* progress

* fixed KG extraction

* nits

* updates

* simplifications & cleanup

* fixes

* updates

* more feature flag checks

* fixes

* extraction process fix

* read-only user creation as part of setup

* fix for missing entity attributes

* kg read-only user creation as part of migration

* typo

* EL initial comments

* initial Account/SF Connector chnges

* SF Connector update

 - include account information

* base w/ salesforce

* evan updates + quite a bit more

* kg-filtered search

* EL changes pt 2

* migrations and env vars

* quick migration fix

* migration update

* post_rebase fixes

* mypy fixes

* test fixes

* test fix

* test fix

* read_only pool + misc

* nf

* env vars

* test improvements

* salesforce fix

* test update

* small changes

* small adjustments

* SF Connector fix & kg_stage removal for one table

* mypy fix

* small fixes

* EL + RK (pt 1) comments

* nit

* setting updated

* Salesforce test update

* EL comments

* read-only user replacement & cleanup

* SQL View fix

* converting entity type-name separators

* sql view group ownership

* view fix

* SQL tweak

* dealing with docs that were skipped by indexing

* increased error handling

* more error handling

* Output formatting fix

* kg-incremental-reindexing

* 0-doc found improvement

* celery

* migration correction

* timeout adjustments

* nit

* Updated migration

* Entity Normalization for KG Dev 1 (#4746)

* feat: trigrams column

* fix: reranking and db

* feat: v1

* fix: convert to orm

* feat: parallel

* fix: default to id_name

* fix: renamed semantic_id and semantic_id_trigrams

* fix: scalar subquery

* fix: tuning + redundancy

* fix: threshold

* fix: typo

* fix: shorten names

* wip

* fix: reverted

* feat: config

* feat: works but it was dumb

* feat: clustering works

* fix: mypy

* normalization <-> language awareness for SQL generation

* small type fixes

---------

Co-authored-by: joachim-danswer <joachim@danswer.ai>

* mypy

* typo and dead code

* kg_time_fencing

* feat: remove temp views on migration downgrade

* remove functions and triggers for now

* rebase adjustments

* EL code review results

* quick fix + trigger/funcs for single tenant

* fix: typo, mypy, dead code

* fix: autoflake

* small updatesd

* nit

* fix: typo

* early + faster view creation

* Extension creation in MT migration

* nit changes to default ETs

* Incremental Clustering and KG Refactor V1 (#4784)

Optimized/restructured incremental clustering. New pipeline actually that moves vespa updates to clustering.
Also, celery configuration has been updated.
---------

Co-authored-by: joachim-danswer <joachim@danswer.ai>

* prompt tweak & ET extraction reset

* more general hierarchical structure

* feat: better vespa reset logic

* prompt optimization and entity replacemants

* small prompt changes

* KG Refactor V2 (#4814)

Clustering & Extraction improvements & various nits 

Co-authored-by: joachim-danswer <joachim@danswer.ai>

* add connector-level coverage days

* fix: nit

* initial  EL responses

* refactor: helper functions for formatting

* fix: more helper fns & comments

* fix: comment code that's been implemented elsewhere

* fix: tenant_id missing arg

* fix: removed debugging stuff

* fix: moved kg_interactions db query to helper fn

* fix: tenant_id

* fix: tenant_id & removed outdated helper fn

* fix always set entity class

* fix: typo

* fix alembic heads

* fix: celery logging

* fix: migrations fix

* fix: multi tenant permissions

* fix: temp connector fix

* fix: downgrade

* Fix upgrade migration

* fix: tenant for normalization

* added additional acl

* stray EL comments

* fix: connector test

* fix mypy

* fix: temporary connector test fix

* fix: jira connector test

* nit

* small nits

* fix: black

* fix: mypy

* fix: mypy

---------

Co-authored-by: Rei Meguro <36625832+Orbital-Web@users.noreply.github.com>
2025-06-07 23:14:20 +00:00
Wenxi
21dc3a2456 Restart script clarity (#4839)
* Add error clarity to restart containers script

* erroneous cleanup on exit

* fix when starting containers for the first time

---------

Co-authored-by: Wenxi Onyx <wenxi-onyx@Wenxis-MacBook-Pro.local>
2025-06-06 09:11:58 -07:00
Wenxi
9631f373f0 Restart script clarity (#4837)
* Add error clarity to restart containers script

* erroneous cleanup on exit

* space

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>

---------

Co-authored-by: Wenxi Onyx <wenxi-onyx@Wenxis-MacBook-Pro.local>
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
2025-06-05 19:24:06 -07:00
rexjohannes
cbd4d46fa5 remove Hagen from CONTRIBUTING.md (#4778)
* remove Hagen from CONTRIBUTING.md

* fix slack invite url

* fix second slack invite
2025-06-05 18:59:34 -07:00
Wenxi
dc4b9bc003 Fixed indexing when no sites are specified (#4822)
* Fixed indexing when no sites are specificed

* Added test for Sharepoint all sites index

* Accounted for paginated results.

* Typing

* Typing

---------

Co-authored-by: Wenxi Onyx <wenxi-onyx@Wenxis-MacBook-Pro.local>
2025-06-05 23:25:20 +00:00
Chris Weaver
affb9e6941 Extend the onyx_vespa_schemas.py script (#4835) 2025-06-05 22:47:32 +00:00
Chris Weaver
dc542fd7fa Enable default quantization (#4815)
* Adjust migration

* update default in form

* Add cloud indices for bfloat16

* Update backend/shared_configs/configs.py

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>

* Update vespa schema gen script

* Move embedding configs

* Remove unused imports

* remove import from shared configs

* Remove unused model

---------

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
2025-06-05 14:02:08 -07:00
rkuo-danswer
85eeb21b77 add slack percentage progress (#4809)
* add percentage progress

* range checking

* formatting

* for new channels, skip them if the most recent messages are all from bots

* comments

* bypass bot channels

* code review

---------

Co-authored-by: Richard Kuo (Onyx) <rkuo@onyx.app>
2025-06-05 17:52:46 +00:00
Rei Meguro
4bb3ee03a0 Update GitHub Connector metadata (#4769)
* feat: updated github metadata

* feat: nullity check

* feat: more metadata

* feat: userinfo

* test: connector test + more metadata

* feat: num files changed

* feat str

* feat: list of str
2025-06-04 18:33:14 +00:00
Maciej Bryński
1bb23d6837 Upgrade asyncpg for Python 3.12 (#4699) 2025-06-04 11:44:52 -07:00
joachim-danswer
f447359815 bump up agent timeouts across the board (#4821) 2025-06-04 14:36:46 +00:00
Weves
851e0b05f2 Small tweak to user invite flow 2025-06-04 08:09:33 -07:00
Chris Weaver
094cc940a4 Small embedding model cleanups (#4820)
* Small embedding model cleanups

* fix

* address greptile

* fix build
2025-06-04 00:10:44 +00:00
rkuo-danswer
51be9000bb Feature/vespa bump (#4819)
* bump cloudformation

* update kubernetes

* bump helm chart

* bump docker compose

* update chart.lock

* ai accident!

* bump vespa helm chart for fix

* increase timeout

---------

Co-authored-by: Richard Kuo (Onyx) <rkuo@onyx.app>
2025-06-04 00:03:01 +00:00
joachim-danswer
80ecdb711d New metadata for Jira for KG (#4785)
* new metadata components

* nits & tests
2025-06-03 20:12:56 +00:00
Chris Weaver
a599176bbf Improve reasoning detection (#4817)
* Improve reasoning detection

* Address greptile comments

* Fix mypy
2025-06-03 20:01:12 +00:00
rkuo-danswer
e0341b4c8a bumping docker push action version (#4816)
Co-authored-by: Richard Kuo (Onyx) <rkuo@onyx.app>
2025-06-03 12:47:01 -07:00
CaptainJeff
4c93fd448f fix: updating gemini models (#4806)
Co-authored-by: Jeffrey Drakos <jeffreydrakos@Jeffreys-MacBook-Pro-2.local>
2025-06-03 11:16:42 -07:00
Chris Weaver
84d916e210 Fix hard delete of agentic chats (#4803)
* Fix hard delete of agentic chats

* Update backend/tests/integration/tests/chat/test_chat_deletion.py

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>

* Address Greptile comments

* fix tests

---------

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
2025-06-03 11:14:11 -07:00
Weves
f57ed2a8dd Adjust script 2025-06-02 18:39:00 -07:00
trial-danswer
713889babf [Connectors][Script] Resume Paused Connectors (#4798)
* [Connectors][Script] Resume Paused Connectors

* Addressing comment
2025-06-02 18:34:00 -07:00
Weves
58c641d8ec Remove ordering-only flow 2025-06-02 18:29:42 -07:00
Weves
94985e24c6 Adjust user file access 2025-06-02 17:28:49 -07:00
Evan Lohn
4c71a5f5ff drive perm sync logs + misc deployment improvements (#4788)
* some logs

* give postgress more memory

* give postgress more memory

* give postgress more memory

* revert

* give postgress more memory

* bump external access limit

* vespa timeout

* deployment consistency

* bump vespa version

* skip upgrade check

* retry permission by ids

* logs

* fix temp docx file issue

* fix drive file deduping

* RK comments

* mypy

* aggregate logs
2025-06-01 23:36:57 +00:00
rkuo-danswer
b19e3a500b try fixing slack bot (#4792)
* try fixing slack bot

* add logging

* just use if

* safe msg get

* .close isn't async

* enforce block list size limit

* various fixes and notes

* don't use self

* switch to punkt_tab

* fix return condition

* synchronize waiting, use non thread local redis locks

* fix log format, make collection copy more explicit for readability

* fix some logging

* unnecessary function

---------

Co-authored-by: Richard Kuo (Onyx) <rkuo@onyx.app>
2025-05-31 00:39:14 +00:00
Chris Weaver
267fe027f5 Fix failed docs table (#4800)
* Fix initial LLM provider set up

* Fix IndexAttemptErrorsModal pagination
2025-05-30 22:19:52 +00:00
Evan Lohn
0d4d8c0d64 jira daylight savings handling (#4797) 2025-05-30 19:13:38 +00:00
Chris Weaver
6f9d8c0cff Simplify passing in of file IDs for filtering (#4791)
* Simplify passing in of file IDs for filtering

* Address RK comments
2025-05-30 05:08:21 +00:00
Weves
5031096a2b Fix frozen add token rate limit migration 2025-05-29 22:22:36 -07:00
rkuo-danswer
797e113000 add a comment (#4789)
Co-authored-by: Richard Kuo (Onyx) <rkuo@onyx.app>
2025-05-29 14:11:19 -07:00
Raunak Bhagat
edc2892785 fix: Remove "Refining Answer" popup (#4783)
* Clean up logic

* Remove dead code

* Remove "Refining Answer" prompt
2025-05-29 19:55:38 +00:00
rkuo-danswer
ef4d5dcec3 new slack rate limiting approach (#4779)
* fix slack rate limit retry handler for groups

* trying to mitigate memory usage during csv download

* Revert "trying to mitigate memory usage during csv download"

This reverts commit 48262eacf6.

* integrated approach to rate limiting

* code review

* try no redis setting

* add pytest-dotenv

* add more debugging

* added comments

* add more stats

---------

Co-authored-by: Richard Kuo (Onyx) <rkuo@onyx.app>
2025-05-29 19:49:32 +00:00
Evan Lohn
0b5e3e5ee4 skip excel files that openpyxl fails on (#4787) 2025-05-29 18:09:46 +00:00
SubashMohan
f5afb3621e connector filter bug fix (#4771)
* connector filter bug fix

* refactor: use ValidStatuses type for last status filter

---------

Co-authored-by: Subash <subash@onyx.app>
2025-05-29 15:17:04 +00:00
rkuo-danswer
9f72826143 Bugfix/slack bot debugging (#4782)
* adding some logging

* better var name

---------

Co-authored-by: Richard Kuo (Onyx) <rkuo@onyx.app>
2025-05-28 18:43:11 +00:00
rkuo-danswer
ab7a4184df Feature/helm k8s probes 2 (#4766)
* add probes

* lint fixes

* add beat probes

---------

Co-authored-by: Richard Kuo (Onyx) <rkuo@onyx.app>
2025-05-28 05:20:24 +00:00
rkuo-danswer
16a14bac89 Feature/tenant reporting 2 (#4750)
* add more info

* fix headers

* add filename as param (merge)

* db manager entry in launch template

---------

Co-authored-by: Richard Kuo (Onyx) <rkuo@onyx.app>
2025-05-27 23:24:47 +00:00
Raunak Bhagat
baaf31513c fix: Create new grouping for CRM connectors (#4776)
* Create new grouping for CRM connectors
* Edit spacing
2025-05-27 06:51:34 -07:00
Rei Meguro
0b01d7f848 refactor: stream_llm_answer (#4772)
* refactor: stream_llm_answer

* fix: lambda

* fix: mypy, docstring
2025-05-26 22:29:33 +00:00
rkuo-danswer
23ff3476bc print sanitized api key to help troubleshoot (#4764)
Co-authored-by: Richard Kuo (Onyx) <rkuo@onyx.app>
2025-05-24 22:27:37 +00:00
Chris Weaver
0c7ba8e2ac Fix/add back search with files (#4767)
* Allow search w/ user files

* more

* More

* Fix

* Improve prompt

* Combine user files + regular uploaded files
2025-05-24 15:44:39 -07:00
Evan Lohn
dad99cbec7 v1 refresh drive creds during perm sync (#4768) 2025-05-23 23:01:26 +00:00
Chris Weaver
3e78c2f087 Fix POSTGRES_IDLE_SESSIONS_TIMEOUT (#4765) 2025-05-23 14:55:23 -07:00
rkuo-danswer
e822afdcfa add probes (#4762)
* add probes

* lint fixes

---------

Co-authored-by: Richard Kuo (Onyx) <rkuo@onyx.app>
2025-05-23 02:24:54 +00:00
rkuo-danswer
b824951c89 add probe signals for beat (#4760)
Co-authored-by: Richard Kuo (Onyx) <rkuo@onyx.app>
2025-05-23 01:41:11 +00:00
Evan Lohn
ca20e527fc fix tool calling for bedrock claude models (#4761)
* fix tool calling for bedrock claude models

* unit test

* fix unit test
2025-05-23 01:13:18 +00:00
rkuo-danswer
c8e65cce1e add k8s probes (#4752)
* add file signals to celery workers

* improve probe script

* cancel tref

---------

Co-authored-by: Richard Kuo (Onyx) <rkuo@onyx.app>
2025-05-22 20:21:59 +00:00
rkuo-danswer
6c349687da improve impersonation logging slightly (#4758)
Co-authored-by: Richard Kuo (Onyx) <rkuo@onyx.app>
2025-05-22 11:17:27 -07:00
Raunak Bhagat
3b64793d4b Update listener passing (#4751) 2025-05-22 01:31:20 +00:00
Rei Meguro
9dbe12cea8 Feat: Search Eval Testing Overhaul (provide ground truth, categorize query, etc.) (#4739)
* fix: autoflake & import order

* docs: readme

* fix: mypy

* feat: eval

* docs: readme

* fix: oops forgot to remove comment

* fix: typo

* fix: rename var

* updated default config

* fix: config issue

* oops

* fix: black

* fix: eval and config

* feat: non tool calling query mod
2025-05-21 19:25:10 +00:00
rkuo-danswer
e78637d632 mitigate memory usage during csv download (#4745)
* mitigate memory usage during csv download

* more gc tweaks

* missed some small changes

---------

Co-authored-by: Richard Kuo (Onyx) <rkuo@onyx.app>
2025-05-21 00:44:27 +00:00
Evan Lohn
cac03c07f7 v1 answer refactor (#4721)
* v1 answer refactor

* fix tests

* good catch, tests

* more cleanup
2025-05-20 23:34:27 +00:00
Raunak Bhagat
95dabfaa18 fix: Add back Teams' replies processing (#4744)
* Add replies to document construction and edit tests

* Update tests

* Add replies processing to teams

* Fix test

* Add try-except block around potential failure

* Update entity-id during ConnectorFailure raise
2025-05-20 22:55:28 +00:00
rkuo-danswer
e92c418e0f Feature/openapi (#4710)
* starting openapi support

* fix app / app_fn

* send gitignore

* dedupe function names

* add readme

* modify gitignore

* update launch template

* fix unused path param

* fix mypy

* local tests pass

* first pass at making integration tests work

* fixes

* fix script path

* set python path

* try full path

* fix output dir

* fix integration test

* more build fixes

* add generated directory

* use the config

* add a comment

* add

* modify tsconfig.json

* fix index linting bugs

* tons of lint fixes

* new gitignore

* remove generated dir

* add tasks template

* check for undefined explicitly

* fix hooks.ts

* refactor destructureValue

* improve readme

---------

Co-authored-by: Richard Kuo (Onyx) <rkuo@onyx.app>
2025-05-20 21:33:18 +00:00
Chris Weaver
0593d045bf Fix ext_perm_user sign-up for non multi-tenant (#4743)
* OAuth w/ external user fix

* Apply to basic auth as well
2025-05-20 20:17:01 +00:00
rkuo-danswer
fff701b0bb fix slack rate limit retry handler for groups (#4742)
Co-authored-by: Richard Kuo (Onyx) <rkuo@onyx.app>
2025-05-20 18:54:27 +00:00
rkuo-danswer
0087a32d8b database isn't a var! (#4741)
Co-authored-by: Richard Kuo (Onyx) <rkuo@onyx.app>
2025-05-20 11:02:53 -07:00
rkuo-danswer
06312e485c make sure the permission client uses the proper retry handler (#4737)
* make sure the permission client uses the proper retry handler

* fix client

---------

Co-authored-by: Richard Kuo <rkuo@rkuo.com>
2025-05-19 21:07:00 -07:00
Evan Lohn
e0f5b95cfc full drive perm sync 2025-05-19 21:06:43 -07:00
Chris Weaver
10bc072b4b Improve drive group sync (#4736)
* Improve drive group sync

* Fix mypy
2025-05-20 02:39:27 +00:00
Evan Lohn
b60884d3af don't fail on fake files (#4735)
* don't fail on fake files

* solve at the source

* oops

* oops2
2025-05-19 23:09:34 +00:00
Chris Weaver
95ae6d300c Fix slack bot kubernetes template (#4734)
* Fix slack path for kubernetes files

* Add env variables
2025-05-19 21:25:57 +00:00
Evan Lohn
b76e4754bf anthropic fix (#4733)
* anthropic fix

* naming
2025-05-19 20:34:29 +00:00
rkuo-danswer
b1085039ca fix nltk punkt (#4732)
Co-authored-by: Richard Kuo (Onyx) <rkuo@onyx.app>
2025-05-19 20:29:02 +00:00
Rei Meguro
d64f479c9f feat: error handling & optimization (#4722) 2025-05-19 20:27:22 +00:00
Raunak Bhagat
fd735c9a3f perf: Change query-exporting to use generators instead of expanding fully into memory (#4729)
* Change query-exporting to use generators instead of expanding fully into memory

* Fix pagination logic

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>

* Add type annotation

* Add early break if list of chat_sessions is empty

---------

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
2025-05-19 20:09:45 +00:00
rkuo-danswer
2282f6a42e fix restart (#4726)
Co-authored-by: Richard Kuo (Onyx) <rkuo@onyx.app>
2025-05-17 17:02:27 +00:00
Rei Meguro
0262002883 fix: continue button (#4724) 2025-05-16 15:43:37 -07:00
Chris Weaver
01ca9dc85d Fix OAuth w/ ext_perm_user for multi-tenant (#4723)
* Fix OAuth w/ ext_perm_user for multi-tenant

* Improve comment
2025-05-16 14:44:01 -07:00
Weves
0735a98284 Fix import ordering 2025-05-16 14:43:50 -07:00
Emerson Gomes
8d2e170fc4 Use LiteLLM DB for determining model tool capability (#4698)
* Bump LiteLLM

* Use LiteLLM DB for determining model tool capability instead of using hardcoded list

* Make function defaults explicit
2025-05-16 13:31:39 -07:00
SubashMohan
f3e2795e69 Highlight active link in AdminSidebar based on current pathname (#4719)
* Highlight active link in AdminSidebar based on current pathname

* Refactor AdminSidebar to declare pathname variable earlier

---------

Co-authored-by: Subash <subash@onyx.app>
2025-05-16 04:55:28 +00:00
Rei Meguro
30d9ce1310 feat: search quality eval (#4720)
* fix: import order

* test examples

* fix: import

* wip: reranker based eval

* fix: import order

* feat: adjuted score

* fix: mypy

* fix: suggestions

* sorry cvs, you must go

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>

* fix: mypy

* fix: suggestions

---------

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
2025-05-15 23:44:33 +00:00
Evan Lohn
2af2b7f130 fix connector tests and drive indexing (#4715)
* fix connector tests and drive indexing

* fix other test

* fix checkpoint data bug
2025-05-15 19:15:46 +00:00
SubashMohan
9d41820363 UI fixes (#4709)
Co-authored-by: Subash <subash@onyx.app>
2025-05-15 05:46:51 +00:00
rkuo-danswer
a44f289aed restructure to signal activity while processing (#4712)
Co-authored-by: Richard Kuo (Onyx) <rkuo@onyx.app>
2025-05-15 05:23:11 +00:00
SubashMohan
9c078b3acf Implement pagination for retrieving spots in HighspotClient (#4705)
Co-authored-by: Subash <subash@onyx.app>
2025-05-15 00:32:12 +00:00
Rei Meguro
349f2c6ed6 Bugfix/usage report UUID (#4703)
* feat: replace user id with username in user report

* feat: pagelink arrow disable

* fix: import order

* fix: removed things we're not doing
2025-05-14 22:27:01 +00:00
rkuo-danswer
0dc851a1cf use existing session user if it matches the email (#4706)
Co-authored-by: Richard Kuo <rkuo@rkuo.com>
2025-05-14 22:18:56 +00:00
Raunak Bhagat
f27fe068e8 Add env variables (#4711) 2025-05-14 19:39:29 +00:00
Evan Lohn
f836cff935 reset to prs on next checkpoint (#4704)
* reset to prs on next checkpoint

* github time fix
2025-05-14 18:47:38 +00:00
Raunak Bhagat
312e3b92bc perf: Implement checkpointing for Teams Connector. (#4601)
* Add basic foundation for teams checkpointing classes

* Fix slack connector main entrypoint

* Saving changes

* Finish teams checkpointing impl

* Remove commented out code

* Remove more unused code

* Move code around

* Add threadpool to process requests in parallel

* Fix mypy errors / warnings

* Move test import to main function only

* Address nits on PR

* Remove unnecessary check prior to entering while-loop

* Remove print statement

* Change exception message

* Address more nits

* Use indexing instead of destructuring

* Add back invocation of `run_with_timeout` instead of a direct call

* Revert slack testing code

* Move early return to before second API call

* Pull fetch to team outside of loop

* Address nits on PR

* Add back client-side filtering

* Updated connector to return after a team's indexing is finished

* Add type ignore

* Implement proper datetime range fetching

* Address comment on PR

* Rename function

* Change exception type when no team with the given id was found

* Address nit on PR

* Add comment on why `page_loaded` is needed to be specified explicitly

* Remove duplicated calls to fetching channels

* Use helper function for thread-based yielding instead of manual logic

* Move datetime filtering to message-level instead

* Address more comments on PR

* Add new utility function for yielding sections

* Add additional utility function

* Add teams tests

* Edit error message

* Address nits on PR

* Promote url-prefix to be a class level constant

* Fix mypy error

* Remove start/end parameters from function that doesn't use them anymore; move around comments

* Address more nits on PR

* Add comment
2025-05-14 04:30:57 +00:00
Evan Lohn
0cc0964231 Perf/drive finer checkpoints (#4702)
* celery and drive fixes

* some initial nits

* skip weird files

* safer extension check

* fix drive
2025-05-14 03:15:29 +00:00
Chris Weaver
b82278e685 Fix heavy import (#4701) 2025-05-13 23:04:16 +00:00
Richard Kuo (Onyx)
daa1746b4a just readme fixes 2025-05-13 09:56:07 -07:00
rkuo-danswer
d8068f0a68 Feature/helm separate workers (#4679)
* add test

* try breaking out background workers

* fix helm lint complaints

* rename disabled files more

* try different folder structure

* fix beat selector

* vespa setup should break on success

* improved instructions for basic helm chart testing

---------

Co-authored-by: Richard Kuo (Onyx) <rkuo@onyx.app>
2025-05-13 02:23:32 +00:00
Chris Weaver
d91f776c2d Fix initial checkpoint save (#4697)
* Fix initial checkpoint save

* Improve comment

* Another small fix
2025-05-13 01:59:07 +00:00
Chris Weaver
a01135581f Small GitHub enhancements (#4696)
* Small github enhancements

* Fix manual run

* Address EL comments
2025-05-13 01:14:16 +00:00
rkuo-danswer
392b87fb4f Bugfix/limit permission size (#4695)
* add utility function

* add utility functions to DocExternalAccess

* refactor db access out of individual celery tasks and put it directly into the heavy task

* code review and remove leftovers

* fix circular imports

---------

Co-authored-by: Richard Kuo (Onyx) <rkuo@onyx.app>
2025-05-13 00:46:31 +00:00
Evan Lohn
551a05aef0 light worker discovers beat task (#4694)
* light worker discovers beat task

* v2: put in right place
2025-05-12 21:20:18 +00:00
rkuo-danswer
6b9d0b5af9 ensure we don't tag 'latest' with cloud images (#4688)
* ensure we don't tag 'latest' with cloud images

* add docker login to trivy

* fix tag names

* flavor latest false (no auto latest tags)

* fix typo

* only run the appropriate workflow for web

---------

Co-authored-by: Richard Kuo (Onyx) <rkuo@onyx.app>
Co-authored-by: Richard Kuo <rkuo@rkuo.com>
2025-05-12 17:23:01 +00:00
Chris Weaver
b8f3ad3e5d Fix/remove ee fe (#4690)
* Remove ee imports from FE

* Remove ee imports from FE

* Style
2025-05-12 02:31:04 +00:00
Chris Weaver
b19515e25d Fix window_start (#4689)
* Fix window_start

* Add comment
2025-05-12 00:11:20 +00:00
Chris Weaver
913f7cc7d4 Fix/remove ee from mit (#4682)
* Remove some ee imports

* more

* Remove all ee imports

* Fix

* Autodiscover

* fix

* Fix typing

* More celery task stuff

* Fix import
2025-05-11 22:09:50 +00:00
rkuo-danswer
84566debab set field size limit (#4683)
* set field size limit

* don't use sys.maxsize

---------

Co-authored-by: Richard Kuo <rkuo@rkuo.com>
Co-authored-by: Richard Kuo (Onyx) <rkuo@onyx.app>
2025-05-09 22:46:13 +00:00
rkuo-danswer
1a8b7abd00 add test (#4676)
* add test

* comment

---------

Co-authored-by: Richard Kuo (Onyx) <rkuo@onyx.app>
2025-05-09 21:38:51 +00:00
Evan Lohn
4c0423f27b fix github cursor pagination infinite loop (#4673)
* fix infinite loop

* unit test for infinite loop issue

* mypy version

* more logging

* unbound locals
2025-05-09 21:35:37 +00:00
rkuo-danswer
7965fd9cbb run testing (#4681)
* run testing

* need to break on success

* add a readme

* raise vespa to 6GB

* allow test to retry

* add 20 attempts

* put memory limits back to normal

* restore chart testing on changes only

* increase retries to 40

---------

Co-authored-by: Richard Kuo (Onyx) <rkuo@onyx.app>
2025-05-09 11:49:43 -07:00
Chris Weaver
91831f4d07 Fix user count (#4677)
* Fix user count

* Add helper + fix async function as well

* fix mypy

* Address RK comment
2025-05-08 17:19:40 -07:00
Chris Weaver
1dd98a87cc Try to reduce memory usage on group sync (#4678) 2025-05-08 22:53:53 +00:00
rkuo-danswer
0dd65cc839 enterprise settings needs to 403 on tenant id absence (#4675)
Co-authored-by: Richard Kuo (Onyx) <rkuo@onyx.app>
2025-05-08 18:32:12 +00:00
Chris Weaver
519aeb6a1f Drive perm sync enhancement (#4672)
* Enhance drive perm sync

* add tests

* more stuff

* fixes

* Fix

* Speed up

* Add missing file

* Address EL comments

* Add ondelete=CASCADE

* Improve comment
2025-05-08 03:12:41 +00:00
Evan Lohn
0eab6ab935 fix drive slowness (#4668)
* fix slowness

* no more silent failing for users

* nits

* no silly info transfer
2025-05-07 22:48:08 +00:00
Evan Lohn
ee09cb95af fixes foreign key violation (#4670)
* fixes foreign key violation

* nit
2025-05-07 18:27:32 +00:00
Evan Lohn
8a9a66947e make 404s skippable (#4671) 2025-05-07 18:04:35 +00:00
Raunak Bhagat
d744c0dab4 fix: Fix error in which channel names would not have the leading "#" removed (#4664)
* Fix failing entrypoint into slack connector

* Pre-filter channel names upon instantiation of slack connector class

* Add decrypt script

* Add slack connector tests

* Fix mypy errors on decrypt.py

* Add property to SlackConnector class

* Add some basic tests

* Move location of tests

* Change name of env token

* Add secrets for Slack

* Add more parameterized cases

* Change env variable name

* Change names

* Update channel names

* Edit tests

* Modify tests

* Only import type in __main__

* Fix tests to actually test connectors

* Pass parameter to fixture directly
2025-05-07 04:55:21 +00:00
Chris Weaver
70df685709 Non default schema fix (#4667)
* Use correct postgres schema

* Remove raw Session() use

* Refactor + add test

* Fix comment
2025-05-06 20:35:59 -07:00
Chris Weaver
f85ef78238 Add more logging for confluence perm-sync + handle case where permiss… (#4586)
* Add more logging for confluence perm-sync + handle case where permissions are removed from the access token

* Make required permissions are explicit

* more

* Add slim fetch limit + mark all cc pairs of source type as successful upon group sync

* Add to dev compose

* Small teams fix

* Add file

* Add single limit pagination for confluence

* Restrict to server only

* more logging

* cleanup

* Cleanup

* Remove CONFLUENCE_CONNECTOR_SLIM_FETCH_LIMIT

* Handle teams error

* Fix ut

* Remove db dependency from confluence_doc_sync

* move stuff back to debug
2025-05-06 18:35:14 +00:00
Evan Lohn
2d7e48d8e8 possible mangling fix (#4666)
* possible mangling fix

* fixed nextUrl setting

* global bad
2025-05-06 15:51:39 +00:00
rkuo-danswer
8231328dc6 restore caching and fix up some prefixing (#4649)
* restore caching and fix up some prefixing

* try backend matrix build and fix artifact names

* need id

* add backslashes to be consistent

* fix no-cache

* leave docker tags to the meta action

* need checkout in merge

* add comment

* move spammy logs to debug status

* bunch of no-cache updates

* prefix

---------

Co-authored-by: Richard Kuo (Onyx) <rkuo@onyx.app>
2025-05-05 16:43:29 +00:00
Chris Weaver
7763e2fa23 Fix non-default schema in KV store (#4655)
* Fix non-default schema in KV store

* Fix custom schema
2025-05-04 22:19:35 +00:00
Chris Weaver
6085bff12d Fix test / display models (#4657)
* Fix test / display models

* Address greptile comments

* Increase wait time

* Increase overall timeout

* Move stuff to utils file

* Updates
2025-05-04 14:04:03 -07:00
Weves
97d60a89ae Add LRU cache to get_model_map 2025-05-03 17:43:58 -07:00
Raunak Bhagat
79b981075e perf: Optimize query history exporting process (#4602)
* Update mode to be a default parameter in `FileStore.read`

* Move query history exporting process to be a background job instead

* Move hardcoded report-file-naming to a common utility function

* Add type annotations

* Update download component

* Implement button to re-ping and download CSV file; fix up some backend file-checking logic

* De-indent logic (w/ early return)

* Return different error codes dependings on the type of task status

* Add more resistant failure retrying mechanisms

* Remove default parameter in helper function

* Use popup for error messaging

* Update return code

* Update web/src/app/ee/admin/performance/query-history/DownloadAsCSV.tsx

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>

* Add type to useState call

* Update backend/ee/onyx/server/query_history/api.py

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>

* Update backend/onyx/file_store/file_store.py

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>

* Update backend/ee/onyx/background/celery/apps/primary.py

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>

* Move rerender call to after check

* Run formatter

* Add type conversions back (smh greptile)

* Remove duplicated call to save_file

* Move non-fallible logic out of try-except block

* Pass date-ranges into API call

* Convert to ISO strings before passing it into the API call

* Add API to list all tasks

* Create new pydantic model to represent tasks to return instead

* Change helper to only fetch query-history tasks

* Use `shared_tasks` instead of old method

* Address more comments from PR; consolidate how task name is generated

* Mark task as failed if any exception is raised

* Change the task object which is returned back to the FE

* Add a table to display previously generated query-history-csv's

* Add timestamps to task; delete tasks as soon as file finishes processing

* Raise exception if start_time is not present

* Convert hard-coded string to constant

* Add "Generated At" field to table

* Return task list in sorted order (based off of start-time)

* Implement pagination

* Remove unused props and cleanup tailwind classes

* Change the name of kickoff button

* Redesign how previous query exports are viewed

* Make button a constant width even when contents change

* Remove timezone information before comparing

* Decrease interval time for re-pinging API

* Add timezone to start-time creation

* Add a refreshInterval for getting updated task status

* Add new background queue

* Edit small verbiage and remove error popup when max-retries is hit

* Change up heavy worker to recognize new task in new module

* Ensure `celery_app` is imported

* Change how `celery_app` is imported and defined

* Update comment on why `celery_app` must be imported

* Add basic skeleton for new beat task to cleanup any dead / failed query-history-export tasks

* Move cleanup task to different worker / queue

* Implement cleanup task

* Add return type

* Address comment on PR

* Remove delimiter from prefix

* Change name of function to be more descriptive

* Remove delimiter from prefix constant

* Move function invocation closer to usage location

* Move imports to top of file

* Move variable up a scope due to undefined error

* Remove dangling if-statement

* Make function more pure-functional

* Remove redefinition

---------

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
2025-05-03 00:16:35 +00:00
Evan Lohn
113876b276 id not set in checkpoint FINAL (#4656)
* it will never happen again.

* fix perm sync issue

* fix perm sync issue2

* ensure member emails map is populated

* other fix for perm sync

* address CW comments

* nit
2025-05-03 00:10:21 +00:00
rkuo-danswer
5c3820b39f Bugfix/slack timeout (#4652)
* don't log all channels

* print number of channels

* sanitize indexing exception messages

* harden vespa index swap

* use constants and fix list generation

---------

Co-authored-by: Richard Kuo (Onyx) <rkuo@onyx.app>
2025-05-02 18:24:45 +00:00
Evan Lohn
55e4465782 orphan tag cleanup optimization (#4651)
* move orphan tag cleanup to final cleanup section of associated tparent tasks

* naming
2025-05-02 17:22:59 +00:00
Evan Lohn
6d9693dc51 drive file deduping (#4648)
* drive file deduping

* switched to version that does not require thread safety

* thanks greptile

* CW comments
2025-05-02 10:58:16 -07:00
Weves
75fa10cead fix highspot 2025-05-01 14:34:35 -07:00
Richard Kuo (Onyx)
0497bfdf78 fix double entry 2025-05-01 11:26:57 -07:00
rkuo-danswer
0db2ad2132 memory optimize task generation for connector deletion (#4645)
* memory optimize task generation for connector deletion

* test

* fix up integration test docker file

* more no-cache

---------

Co-authored-by: Richard Kuo (Onyx) <rkuo@onyx.app>
2025-05-01 10:47:26 -07:00
Chris Weaver
49cd38fb2d Update README.md 2025-05-01 09:58:33 -07:00
Raunak Bhagat
bd36b2ad6d Remove cursor-help for tooltip (#4643) 2025-05-01 09:38:13 -07:00
Evan Lohn
6436b60763 github cursor pagination (#4642)
* v1 of cursor pagination

* mypy

* unit tests

* CW comments
2025-04-30 19:09:20 -07:00
Raunak Bhagat
a6cc1c84dc Add padding to bottom of pages (#4641) 2025-04-30 22:34:18 +00:00
Weves
8515f4b57a Highspot cleanup 2025-04-30 14:57:36 -07:00
joachim-danswer
f68b74ff4a disable Agent Search refinement by default (#4638)
- created env variable  AGENT_ALLOW_REFINEMENT  with default "". Must be set to true to enable Refinement.
 - added an environment variable for the upper limit of docs that can be sent to verification
2025-04-30 19:51:08 +00:00
rkuo-danswer
e254fdc066 add sendgrid as option (#4639)
* add sendgrid as option

* code review

* mypy

---------

Co-authored-by: Richard Kuo (Onyx) <rkuo@onyx.app>
2025-04-30 07:33:15 +00:00
Raunak Bhagat
f26de37878 Remove info hoverable (#4637) 2025-04-30 03:16:14 +00:00
rkuo-danswer
94de23fe87 Bugfix/chat images 2 (#4630)
* don't hardcode -1

* extra spaces

* fix binary data in blurb

* add note to binary handling

---------

Co-authored-by: Richard Kuo (Onyx) <rkuo@onyx.app>
2025-04-30 01:29:10 +00:00
Chris Weaver
dd242c9926 Fix race condition with archived channels (#4635) 2025-04-29 23:35:40 +00:00
Chris Weaver
47767c1666 Small improvements to checkpoint pickup logic (#4634) 2025-04-29 21:23:54 +00:00
Chris Weaver
9be3da2357 Fix gitlab (#4629)
* Fix gitlab

* Add back assert
2025-04-28 17:42:49 -07:00
Evan Lohn
8961a3cc72 page token for drive group sync (#4627) 2025-04-28 19:06:06 +00:00
Chris Weaver
47b9e7aa62 Fix teams (#4628)
* Fix teams

* Use get_all

* Add comment
2025-04-28 11:53:22 -07:00
Evan Lohn
eebfa5be18 Confluence server api time fix (#4589)
* tolerance of confluence api weirdness

* remove checkpointing

* remove skipping logic from checkpointing

* add back checkpointing

* switch confluence checkpointing to be based on page starts

* address CW comments and fix unit tests

* some mitigations of bad confluence api

* new checkpointing approach and testing fixes

* fix test

* CW comments
2025-04-28 06:06:29 +00:00
Chris Weaver
5047d256b4 Add support for restrictions w/o any access (#4624)
* Add support for restrictions w/o any access

* Fix
2025-04-28 03:09:36 +00:00
Evan Lohn
5db676967f no more duplicate files during folder indexing (#4579)
* no more duplicate files during folder indexing

* cleanup checkpoint after a shared folder has been finished

* cleanup

* lint
2025-04-28 01:01:20 +00:00
Chris Weaver
ea0664e203 Fix LLM API key (#4623)
* Fix LLM API key

* Remove unused import

* Update web/src/app/admin/configuration/llm/LLMProviderUpdateForm.tsx

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>

---------

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
2025-04-27 23:10:36 +00:00
Weves
bbd0874200 fix 2025-04-27 14:37:53 -07:00
rkuo-danswer
c6d100b415 Bugfix/chat images (#4618)
* keep chatfiletype as image instead of user_knowledge

* improve continue message

* fix to image handling

* greptile code review

---------

Co-authored-by: Richard Kuo (Onyx) <rkuo@onyx.app>
2025-04-27 20:34:30 +00:00
Evan Lohn
5ca7a7def9 fix migration and add test (#4615) 2025-04-25 21:27:59 +00:00
Chris Weaver
92b5e1adf4 Add support for overriding user list (#4616)
* Add support for overriding user list

* Fix

* Add typing

* pythonify
2025-04-25 15:15:23 -07:00
Chris Weaver
23c6e0f3bf Single source of truth for image capability (#4612)
* Single source of truth for image capability

* Update web/src/app/admin/assistants/AssistantEditor.tsx

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>

* Fix tests

---------

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
2025-04-25 20:37:16 +00:00
Chris Weaver
ad76e6ac9e Adjust confluence perm sync frequency (#4613)
* Adjust confluence perm sync frequency

* Fiux comment
2025-04-25 19:36:10 +00:00
Evan Lohn
151aabea73 specific user emails for drive connector (#4608)
* specific user emails for drive connector

* fix drive connector tests

* fix connector tests
2025-04-25 18:49:20 +00:00
Chris Weaver
d711680069 Add e2e test for assistant creation/edit (#4597)
* Add e2e test for assistant creation/edit

* Skip initial full reset to have seeded connector
2025-04-25 13:21:34 -07:00
Evan Lohn
9835d55ecb transfer old fileds to new config 2025-04-25 12:25:20 -07:00
Raunak Bhagat
69c539df6e fix: Create migration to re-introduce display_model_names (#4600)
* Fix migration

* Fix migration to take care of various nullability cases

* Address comments on PR

* Rename variables to be more descriptive

* Make helpers private

* Fix select statement

* Add comments to explain the involved logic

* Saving changes

* Finish script to revalidate `display_model_names`

* Address comments on PR by greptile

* Add missing columns

* Pull difference operator out into binding

* Add deletion prior to re-insertion

* Use map from shared llm-provider file instead

* Use helper function instead of copying code

* Remove delete and convert into an update statement

* Use pydantic for ModelConfigurations

* Update to do nothing on-conflict rather than update

* Address nits on PR

* Add default visible model(s) for bedrock

* Perform an update on conflict instead of doing nothing
2025-04-25 10:44:13 -07:00
pablonyx
df67ca18d8 My docs cleanup (#4519)
* update

* improved my docs

* nit

* nit

* k

* push changes

* update

* looking good

* k

* fix preprocessing

* try a fix

* k

* update

* nit

* k

* quick nits

* Cleanup / fixes

* Fixes

* Fix build

* fix

* fix quality checks

---------

Co-authored-by: Weves <chrisweaver101@gmail.com>
2025-04-25 05:20:33 +00:00
Chris Weaver
115cfb6ae9 Fix tool choice (#4596)
* Fix tool choice

* fix
2025-04-24 21:51:14 -07:00
rkuo-danswer
672f3a1c34 fix provisioning and don't spawn tasks which could result in a race condition (#4604)
Co-authored-by: Richard Kuo (Onyx) <rkuo@onyx.app>
2025-04-25 02:41:05 +00:00
Raunak Bhagat
13b71f559f fix: Fix migration issue in which display-model-names were not being appropriately set (#4594)
* Fix migration

* Fix migration to take care of various nullability cases

* Address comments on PR

* Rename variables to be more descriptive

* Make helpers private

* Fix select statement

* Add comments to explain the involved logic

* Add helpers for viewing visible model names

* Fix logic for missing model + display-model names in migration
2025-04-24 21:26:33 +00:00
Evan Lohn
2981b7a425 linear dupe docs fix (#4607) 2025-04-24 21:00:21 +00:00
rkuo-danswer
37adf31a3b fix priority on vespa metadata sync (#4603)
Co-authored-by: Richard Kuo (Onyx) <rkuo@onyx.app>
2025-04-24 19:50:49 +00:00
Raunak Bhagat
5d59850a17 Fix slack formatting bug (#4587) 2025-04-24 17:18:54 +00:00
Evan Lohn
91d6b739a4 ensure drive id set in checkpoint (#4595)
* ensure drive id set in checkpoint

* asserts gone

* address CW
2025-04-24 01:20:13 +00:00
rkuo-danswer
c83ee06062 Feature/salesforce correctness 2 (#4506)
* refactor salesforce sqlite db access

* more refactoring

* refactor again

* refactor again

* rename object

* add finalizer to ensure db connection is always closed

* avoid unnecessarily nesting connections and commit regularly when possible

* remove db usage from csv download

* dead code

* hide deprecation warning in ddtrace

* remove unused param

---------

Co-authored-by: Richard Kuo (Onyx) <rkuo@onyx.app>
2025-04-24 01:05:52 +00:00
Raunak Bhagat
c93cebe1ab fix: Add minor fixes to how model configurations are displayed (#4593)
* Add minor fixes to how model configurations are interacted with

* Remove azure entry
2025-04-23 21:42:02 +00:00
rkuo-danswer
ea1d3c1eda Feature/db script (#4574)
* debug script + slight refactor of db class

* better comments

* move setup logger

---------

Co-authored-by: Richard Kuo (Onyx) <rkuo@onyx.app>
Co-authored-by: Richard Kuo <rkuo@rkuo.com>
2025-04-23 20:00:35 +00:00
rkuo-danswer
c9a609b7d8 Bugfix/slack bot channel config (#4585)
* friendlier handling of slack channel retrieval

* retry on downgrade_postgres deadlock

* fix comment

* text

---------

Co-authored-by: Richard Kuo (Onyx) <rkuo@onyx.app>
2025-04-23 20:00:03 +00:00
rkuo-danswer
07f04e35ec Bugfix/alembic sqlengine (#4592)
* need sqlengine to work

* add comments

---------

Co-authored-by: Richard Kuo (Onyx) <rkuo@onyx.app>
2025-04-23 19:21:34 +00:00
joachim-danswer
d8b050026d removal of keyword 1st phase 2025-04-22 20:29:57 -07:00
Raunak Bhagat
c76dc2ea2c fix: Fix the add_model_configuration migration by removing duplicate model-names during insertion (#4588)
* Convert the model_names and display_model_names into a set instead

* Update backend/alembic/versions/7a70b7664e37_add_model_configuration_table.py

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>

---------

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
2025-04-22 18:59:31 -07:00
rkuo-danswer
5e11c635d9 wrong logger imported in a lot of wrong places (#4582)
Co-authored-by: Richard Kuo (Onyx) <rkuo@onyx.app>
2025-04-22 23:34:02 +00:00
joachim-danswer
669b668463 updated logging and basic search expansion procedure 2025-04-22 11:58:02 -07:00
Raunak Bhagat
85fa083717 fix: Return default value instead of throwing error (#4575)
* Return default value instead of throwing error

* Add default parameter

* Move logic around

* Use dummy value for max_input_tokens in testing flow

* Remove unnecessary assignment
2025-04-22 17:33:36 +00:00
Chris Weaver
420d2614d4 Fix assistants forms (#4578)
* Fix assistant num chunk setting

* test

* Fix test

* Update web/src/app/assistants/mine/AssistantModal.tsx

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>

* Update web/src/app/assistants/mine/AssistantModal.tsx

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>

---------

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
2025-04-22 16:09:03 +00:00
Raunak Bhagat
e3218d358d feat: Add assistant name to UI (#4569)
* Add assistant name to UI

* Fix tailwind styling class
2025-04-22 09:35:35 -07:00
Weves
ae632b5fab Fix missing Connector Configuration 2025-04-21 18:50:25 -07:00
rkuo-danswer
0d4c600852 out of process retry for multitenant test reset (#4566)
* tool to generate vespa schema variations for our cloud

* extraneous assign

* use a real templating system instead of search/replace

* fix float

* maybe this should be double

* remove redundant var

* template the other files

* try a spawned process

* move the wrapper

* fix args

* increase timeout

* run multitenant reset operations out of process as well

---------

Co-authored-by: Richard Kuo (Onyx) <rkuo@onyx.app>
Co-authored-by: Richard Kuo <rkuo@rkuo.com>
2025-04-21 23:30:18 +00:00
Evan Lohn
eb569bf79d add emails to retry with on 403 (#4565)
* add emails to retry with on 403

* attempted fix for connector test

* CW comments

* connector test fix

* test fixes and continue on 403

* fix tests

* fix tests

* fix concurrency tests

* fix integration tests with llmprovider eager loading
2025-04-21 23:27:31 +00:00
Chris Weaver
f3d5303d93 Fix slack bot feedback (#4573)
* Fix slack bot feedback

* Fix

* Make safe
2025-04-21 15:54:48 -07:00
Raunak Bhagat
b97628070e feat: Add ability to specify max input token limit for custom LLM providers (#4510)
* Add multi text array field

* Add multiple values to model configuration for a custom LLM provider

* Fix reference to old field name

* Add migration

* Update all instances of model_names / display_model_names to use new schema migration

* Update background task

* Update endpoints to not throw errors

* Add test

* Update backend/alembic/versions/7a70b7664e37_add_models_configuration_table.py

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>

* Update backend/onyx/background/celery/tasks/llm_model_update/tasks.py

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>

* Fix list comprehension nits

* Update web/src/components/admin/connectors/Field.tsx

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>

* Update web/src/app/admin/configuration/llm/interfaces.ts

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>

* Implement greptile recommendations

* Update backend/onyx/db/llm.py

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>

* Update backend/onyx/server/manage/llm/api.py

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>

* Update backend/onyx/background/celery/tasks/llm_model_update/tasks.py

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>

* Update backend/onyx/db/llm.py

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>

* Fix more greptile suggestions

* Run formatter again

* Update backend/onyx/db/models.py

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>

* Add relationship to `LLMProvider` and `ModelConfigurations` classes

* Use sqlalchemy ORM relationships instead of manually populating fields

* Upgrade migration

* Update interface

* Remove all instances of model_names and display_model_names from backend

* Add more tests and fix bugs

* Run prettier

* Add types

* Update migration to perform data transformation

* Ensure native llm providers don't have custom max input tokens

* Start updating frontend logic to support custom max input tokens

* Pass max input tokens to LLM class (to be passed into `litellm.completion` call later)

* Add ModelConfigurationField component for custom llm providers

* Edit spacing and styling of model configuration matrix

* Fix error message displaying bug

* Edit opacity of `FiX` field for first index

* Change opacity back

* Change roundness

* Address comments on PR

* Perform fetching of `max_input_tokens` at the beginning of the callgraph and rope it throughout the entire callstack

* Change `add` to `execute`

* Move `max_input_tokens` into `LLMConfig`

* Fix bug with error messages not being cleared

* Change field used to fetch LLMProvider

* Fix model-configuration UI

* Address comments

* Remove circular import

* Fix failing tests in GH

* Fix failing tests

* Use `isSubset` instead of equality to determine native vs custom LLM Provider

* Remove unused import

* Make responses always display max_input_tokens

* Fix api endpoint to hit

* Update types in web application

* Update object field

* Fix more type errors

* Fix failing llm provider tests

---------

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
2025-04-21 04:30:21 -07:00
pablonyx
72d3a7ff21 Frontend testing (#4500)
* add o3 + o4 mini

* k

* see which ones fail

* attempt

* k

* k

* llm ordering passing

* all tests passing

* quick bump

* Revert "add o3 + o4 mini"

This reverts commit 4cfa1984ec.

* k

* k
2025-04-20 23:29:47 +00:00
rkuo-danswer
2111eccf07 Feature/vespa jinja (#4558)
* tool to generate vespa schema variations for our cloud

* extraneous assign

* use a real templating system instead of search/replace

* fix float

* maybe this should be double

* remove redundant var

* template the other files

* try a spawned process

* move the wrapper

* fix args

* increase timeout

---------

Co-authored-by: Richard Kuo (Onyx) <rkuo@onyx.app>
Co-authored-by: Richard Kuo <rkuo@rkuo.com>
2025-04-20 22:28:55 +00:00
Chris Weaver
87478c5ca6 Parallelize connector tests (#4563)
* Parallelize connector tests

* Use --dist loadfile

* Add slow test logging
2025-04-19 18:10:50 -07:00
evan-danswer
dc62d83a06 File connector tests (#4561)
* danswer to onyx plus tests for file connector

* actually add test
2025-04-19 15:54:30 -07:00
evan-danswer
5681df9095 address getting attachments forever (#4562)
* address getting attachments forever

* fix unit tests
2025-04-19 15:53:27 -07:00
Chris Weaver
6666300f37 Fix flakey web test (#4551)
* Fix flakey web test

* Increase wait time

* Another attempt to fix

* Simplify + add new test

* Fix web tests
2025-04-19 15:12:11 -07:00
Chris Weaver
7f99c54527 Small improvements to connector UI (#4559)
* Small improvements to connector UI

* Update web/src/app/admin/connector/[ccPairId]/IndexingAttemptsTable.tsx

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>

* Fix last_permission_sync

* Handle cases where a source doesn't need group sync

* fix

---------

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
2025-04-19 19:14:05 +00:00
Chris Weaver
4b8ef4b151 Update README.md 2025-04-18 18:29:56 -07:00
rkuo-danswer
e5e0944049 tool to generate vespa schema variations for our cloud (#4556)
* tool to generate vespa schema variations for our cloud

* extraneous assign

* float, not double

* back to double

---------

Co-authored-by: Richard Kuo (Onyx) <rkuo@onyx.app>
2025-04-18 20:47:17 +00:00
pablonyx
356336a842 add o3 + o4 mini (#4555) 2025-04-18 20:42:35 +00:00
rkuo-danswer
5bc059881e ping with keep alive (#4550)
Co-authored-by: Richard Kuo (Onyx) <rkuo@onyx.app>
2025-04-18 18:44:07 +00:00
rkuo-danswer
fa80842afe Bugfix/harden activity timeout (#4545)
* add some hardening

* add info memory logging

* fix last_observed

* remove log spam

* properly cache last activity details

* default values

---------

Co-authored-by: Richard Kuo (Onyx) <rkuo@onyx.app>
2025-04-18 02:28:22 +00:00
rkuo-danswer
a8a5a82251 slightly better slack logging (#4554)
Co-authored-by: Richard Kuo (Onyx) <rkuo@onyx.app>
2025-04-17 18:45:48 -07:00
evan-danswer
953a4e3793 v1 file connector with metadata (#4552) 2025-04-17 23:02:34 +00:00
rkuo-danswer
04ebde7838 refactor a mega function for readability and make sure to increment r… (#4542)
* refactor a mega function for readability and make sure to increment retry_count on exception so that we don't infinitely loop

* improve session and page level context handling

* don't use pydantic for the session context

* we don't need retry success

* move playwright handling into the session context

* need to break on ok

* return doc from scrape

* fix comment

---------

Co-authored-by: Richard Kuo (Onyx) <rkuo@onyx.app>
2025-04-17 06:43:30 +00:00
Chris Weaver
6df1c6c72f Pull in more fields for Jira (#4547)
* Pull in more fields for Jira

* Fix tests

* Fix

* more fix

* Fix

* Fix S3 test

* fix
2025-04-17 01:52:50 +00:00
Raunak Bhagat
fe94bdf936 fix: Fix duplicate kwarg issue when calling litellm.main.completion (#4533)
* Fix duplicate kwarg issue

* Change how vertex_credentials are passed

* Modify temporary dict instead

* Change string to a global constant

* Add extra condition to if-check during population of map
2025-04-16 19:29:53 -07:00
rkuo-danswer
2a9fd9342e small improvement to checking for image attachments (#4543)
* small improvement to checking for image attachments

* better comments

* check centralized list of types instead of hardcoding them in the connector

---------

Co-authored-by: Richard Kuo (Onyx) <rkuo@onyx.app>
2025-04-17 00:34:22 +00:00
pablonyx
597ad806e3 Skip image files for S3 (#4535)
* skip image files

* process images s3

* tests

* k

* update

* nit

* update
2025-04-16 23:41:00 +00:00
evan-danswer
5acae2dc80 fix re-processing of previously seen docs Confluence (#4544)
* fix re-processing of previously seen docs

* performance
2025-04-16 23:16:21 +00:00
pablonyx
99455db26c add 4.1 (#4540) 2025-04-16 15:34:01 -07:00
pablonyx
0d12e96362 Fix bug with saml validation (#4522)
* fix bug with saml validation

* k
2025-04-16 19:35:58 +00:00
Chris Weaver
7e7b6e08ff Fix confluence perm sync ancestry (#4536)
* Fix confluence perm sync ancestry

* Address EL comments

* add test for special case

* remove print

* Fix test
2025-04-16 03:02:54 +00:00
Raunak Bhagat
1dd32ebfce Remove alert upon submission (#4537) 2025-04-15 19:12:12 -07:00
Weves
c3ffaa19a4 Small no-letsencrypt improvement 2025-04-15 18:29:07 -07:00
pablonyx
f4ea7e62a7 Miscellaneous cleanup (#4516)
* stricter typing

* k
2025-04-15 23:35:13 +00:00
rkuo-danswer
2ac41c3719 Feature/celery beat watchdog (#4534)
* upgrade celery to release version

* make the watchdog script more reusable

* use constant

* code review

* catch interrupt

---------

Co-authored-by: Richard Kuo (Onyx) <rkuo@onyx.app>
2025-04-15 22:05:37 +00:00
evan-danswer
a8cba7abae extra logging for uncommon permissions cases (#4532)
* extra logging for uncommon permissions cases

* address CW comments
2025-04-15 18:56:17 +00:00
evan-danswer
ae9f8c3071 checkpointed confluence (#4473)
* checkpointed confluence

* confluence checkpointing tested

* fixed integration tests

* attempt to fix connector test flakiness

* fix rebase
2025-04-14 23:59:53 +00:00
evan-danswer
742041d97a fix font for dark mode (#4527) 2025-04-14 22:43:03 +00:00
pablonyx
187b93275d k (#4525) 2025-04-14 22:29:47 +00:00
Weves
ca2aeac2cc Fix black 2025-04-14 15:53:09 -07:00
ThomaciousD
f7543c6285 Fix #3764: Dynamically handle default branch in GitLab connector 2025-04-14 15:52:10 -07:00
pablonyx
1430a18d44 cohere validation logic update (#4523) 2025-04-14 21:49:22 +00:00
rkuo-danswer
7c4487585d rollback properly on exception (#4073)
* rollback properly on exception

* rollback on exception

* don't continue if we can't set the search path

* cleaner handling via context manager

---------

Co-authored-by: Richard Kuo (Danswer) <rkuo@onyx.app>
2025-04-14 21:48:35 +00:00
pablonyx
e572ce95e7 Shore up multi tenant tests (#4484)
* update

* fix

* finalize`

* remove unnecessary prints

* fix

* k
2025-04-14 18:34:57 +00:00
evan-danswer
68c6c1f4f8 refactor to use stricter typing (#4513)
* refactor to use stricter typing

* older version of ruff
2025-04-14 17:23:07 +00:00
Chris Weaver
a5edc8aa0f Fix default log level (#4501)
* Fix default log level

* fix
2025-04-14 16:40:11 +00:00
evan-danswer
a377f6ffb6 Unify document deduping (#4520)
* minor cleanup

* cleanup doc deduping and add unit tests
2025-04-14 16:33:00 +00:00
Weves
72ce2f75cc Add env var to docker compose file 2025-04-13 23:14:06 -07:00
joachim-danswer
2683207a24 Expanded basic search (#4517)
* initial working version

* ranking profile

* modification for keyword/instruction retrieval

* mypy fixes

* EL comments

* added env var (True for now)

* flipped default to False

* mypy & final EL/CW comments + import issue
2025-04-13 23:13:01 -07:00
Chris Weaver
e3aab8e85e Improve index attempt display (#4511) 2025-04-13 15:57:47 -07:00
pablonyx
65fd8b90a8 add image indexing tests (#4477)
* address file path

* k

* update

* update

* nit- fix typing

* k

* should path

* in a good state

* k

* k

* clean up file

* update

* update

* k

* k

* k
2025-04-11 22:16:37 +00:00
1932 changed files with 175280 additions and 80589 deletions

2
.github/CODEOWNERS vendored
View File

@@ -1 +1,3 @@
* @onyx-dot-app/onyx-core-team
# Helm charts Owners
/helm/ @justin-tahara

View File

@@ -25,12 +25,26 @@ inputs:
tags:
description: 'Image tags'
required: true
no-cache:
description: 'Read from cache'
required: false
default: 'false'
cache-from:
description: 'Cache sources'
required: false
cache-to:
description: 'Cache destinations'
required: false
outputs:
description: 'Output destinations'
required: false
provenance:
description: 'Generate provenance attestation'
required: false
default: 'false'
build-args:
description: 'Build arguments'
required: false
retry-wait-time:
description: 'Time to wait before attempt 2 in seconds'
required: false
@@ -55,8 +69,12 @@ runs:
push: ${{ inputs.push }}
load: ${{ inputs.load }}
tags: ${{ inputs.tags }}
no-cache: ${{ inputs.no-cache }}
cache-from: ${{ inputs.cache-from }}
cache-to: ${{ inputs.cache-to }}
outputs: ${{ inputs.outputs }}
provenance: ${{ inputs.provenance }}
build-args: ${{ inputs.build-args }}
- name: Wait before attempt 2
if: steps.buildx1.outcome != 'success'
@@ -77,8 +95,12 @@ runs:
push: ${{ inputs.push }}
load: ${{ inputs.load }}
tags: ${{ inputs.tags }}
no-cache: ${{ inputs.no-cache }}
cache-from: ${{ inputs.cache-from }}
cache-to: ${{ inputs.cache-to }}
outputs: ${{ inputs.outputs }}
provenance: ${{ inputs.provenance }}
build-args: ${{ inputs.build-args }}
- name: Wait before attempt 3
if: steps.buildx1.outcome != 'success' && steps.buildx2.outcome != 'success'
@@ -99,8 +121,12 @@ runs:
push: ${{ inputs.push }}
load: ${{ inputs.load }}
tags: ${{ inputs.tags }}
no-cache: ${{ inputs.no-cache }}
cache-from: ${{ inputs.cache-from }}
cache-to: ${{ inputs.cache-to }}
outputs: ${{ inputs.outputs }}
provenance: ${{ inputs.provenance }}
build-args: ${{ inputs.build-args }}
- name: Report failure
if: steps.buildx1.outcome != 'success' && steps.buildx2.outcome != 'success' && steps.buildx3.outcome != 'success'

View File

@@ -0,0 +1,50 @@
name: "Prepare Build (OpenAPI generation)"
description: "Sets up Python with uv, installs deps, generates OpenAPI schema and Python client, uploads artifact"
runs:
using: "composite"
steps:
- name: Checkout code
uses: actions/checkout@v4
- name: Setup uv
uses: astral-sh/setup-uv@v3
- name: Setup Python
uses: actions/setup-python@v5
with:
python-version: "3.11"
- name: Install Python dependencies with uv
shell: bash
run: |
uv pip install --system \
-r backend/requirements/default.txt \
-r backend/requirements/dev.txt
- name: Generate OpenAPI schema
shell: bash
working-directory: backend
env:
PYTHONPATH: "."
run: |
python scripts/onyx_openapi_schema.py --filename generated/openapi.json
- name: Generate OpenAPI Python client
shell: bash
run: |
docker run --rm \
-v "${{ github.workspace }}/backend/generated:/local" \
openapitools/openapi-generator-cli generate \
-i /local/openapi.json \
-g python \
-o /local/onyx_openapi_client \
--package-name onyx_openapi_client \
--skip-validate-spec \
--openapi-normalizer "SIMPLIFY_ONEOF_ANYOF=true,SET_OAS3_NULLABLE=true"
- name: Upload OpenAPI artifacts
uses: actions/upload-artifact@v4
with:
name: openapi-artifacts
path: backend/generated/

View File

@@ -6,9 +6,6 @@
[Describe the tests you ran to verify your changes]
## Backporting (check the box to trigger backport action)
## Additional Options
Note: You have to check that the action passes, otherwise resolve the conflicts manually and tag the patches.
- [ ] This PR should be backported (make sure to check that the backport attempt succeeds)
- [ ] [Optional] Override Linear Check

View File

@@ -0,0 +1,24 @@
name: Check Lazy Imports
on:
merge_group:
pull_request:
branches:
- main
- 'release/**'
jobs:
check-lazy-imports:
runs-on: ubuntu-latest
steps:
- name: Checkout code
uses: actions/checkout@v4
- name: Set up Python
uses: actions/setup-python@v4
with:
python-version: '3.11'
- name: Check lazy imports
run: python3 backend/scripts/check_lazy_imports.py

View File

@@ -7,18 +7,57 @@ on:
env:
REGISTRY_IMAGE: ${{ contains(github.ref_name, 'cloud') && 'onyxdotapp/onyx-backend-cloud' || 'onyxdotapp/onyx-backend' }}
LATEST_TAG: ${{ contains(github.ref_name, 'latest') }}
DEPLOYMENT: ${{ contains(github.ref_name, 'cloud') && 'cloud' || 'standalone' }}
# tag nightly builds with "edge"
EDGE_TAG: ${{ startsWith(github.ref_name, 'nightly-latest') }}
jobs:
build-and-push:
# TODO: investigate a matrix build like the web container
# See https://runs-on.com/runners/linux/
runs-on: [runs-on, runner=8cpu-linux-x64, "run-id=${{ github.run_id }}"]
runs-on:
- runs-on
- runner=${{ matrix.platform == 'linux/amd64' && '8cpu-linux-x64' || '8cpu-linux-arm64' }}
- run-id=${{ github.run_id }}
- tag=platform-${{ matrix.platform }}
strategy:
fail-fast: false
matrix:
platform:
- linux/amd64
- linux/arm64
steps:
- name: Prepare
run: |
platform=${{ matrix.platform }}
echo "PLATFORM_PAIR=${platform//\//-}" >> $GITHUB_ENV
- name: Check if stable release version
id: check_version
run: |
if [[ "${{ github.ref_name }}" =~ ^v[0-9]+\.[0-9]+\.[0-9]+$ ]] && [[ "${{ github.ref_name }}" != *"cloud"* ]]; then
echo "is_stable=true" >> $GITHUB_OUTPUT
else
echo "is_stable=false" >> $GITHUB_OUTPUT
fi
- name: Checkout code
uses: actions/checkout@v4
- name: Docker meta
id: meta
uses: docker/metadata-action@v5
with:
images: ${{ env.REGISTRY_IMAGE }}
flavor: |
latest=false
tags: |
type=raw,value=${{ github.ref_name }}
type=raw,value=${{ steps.check_version.outputs.is_stable == 'true' && 'latest' || '' }}
type=raw,value=${{ env.EDGE_TAG == 'true' && 'edge' || '' }}
- name: Set up Docker Buildx
uses: docker/setup-buildx-action@v3
@@ -34,30 +73,114 @@ jobs:
sudo apt-get install -y build-essential
- name: Backend Image Docker Build and Push
uses: docker/build-push-action@v5
id: build
uses: docker/build-push-action@v6
with:
context: ./backend
file: ./backend/Dockerfile
platforms: linux/amd64,linux/arm64
platforms: ${{ matrix.platform }}
push: true
tags: |
${{ env.REGISTRY_IMAGE }}:${{ github.ref_name }}
${{ env.LATEST_TAG == 'true' && format('{0}:latest', env.REGISTRY_IMAGE) || '' }}
build-args: |
ONYX_VERSION=${{ github.ref_name }}
labels: ${{ steps.meta.outputs.labels }}
outputs: type=image,name=${{ env.REGISTRY_IMAGE }},push-by-digest=true,name-canonical=true,push=true
cache-from: type=s3,prefix=cache/${{ github.repository }}/${{ env.DEPLOYMENT }}/backend-${{ env.PLATFORM_PAIR }}/,region=${{ env.RUNS_ON_AWS_REGION }},bucket=${{ env.RUNS_ON_S3_BUCKET_CACHE }}
cache-to: type=s3,prefix=cache/${{ github.repository }}/${{ env.DEPLOYMENT }}/backend-${{ env.PLATFORM_PAIR }}/,region=${{ env.RUNS_ON_AWS_REGION }},bucket=${{ env.RUNS_ON_S3_BUCKET_CACHE }},mode=max
- name: Export digest
run: |
mkdir -p /tmp/digests
digest="${{ steps.build.outputs.digest }}"
touch "/tmp/digests/${digest#sha256:}"
- name: Upload digest
uses: actions/upload-artifact@v4
with:
name: backend-digests-${{ env.PLATFORM_PAIR }}-${{ github.run_id }}
path: /tmp/digests/*
if-no-files-found: error
retention-days: 1
merge:
runs-on: ubuntu-latest
needs:
- build-and-push
steps:
# Needed for trivyignore
- name: Checkout
uses: actions/checkout@v4
- name: Check if stable release version
id: check_version
run: |
if [[ "${{ github.ref_name }}" =~ ^v[0-9]+\.[0-9]+\.[0-9]+$ ]] && [[ "${{ github.ref_name }}" != *"cloud"* ]]; then
echo "is_stable=true" >> $GITHUB_OUTPUT
else
echo "is_stable=false" >> $GITHUB_OUTPUT
fi
- name: Download digests
uses: actions/download-artifact@v4
with:
path: /tmp/digests
pattern: backend-digests-*-${{ github.run_id }}
merge-multiple: true
- name: Set up Docker Buildx
uses: docker/setup-buildx-action@v3
- name: Docker meta
id: meta
uses: docker/metadata-action@v5
with:
images: ${{ env.REGISTRY_IMAGE }}
flavor: |
latest=false
tags: |
type=raw,value=${{ github.ref_name }}
type=raw,value=${{ steps.check_version.outputs.is_stable == 'true' && 'latest' || '' }}
type=raw,value=${{ env.EDGE_TAG == 'true' && 'edge' || '' }}
- name: Login to Docker Hub
uses: docker/login-action@v3
with:
username: ${{ secrets.DOCKER_USERNAME }}
password: ${{ secrets.DOCKER_TOKEN }}
- name: Create manifest list and push
working-directory: /tmp/digests
run: |
docker buildx imagetools create $(jq -cr '.tags | map("-t " + .) | join(" ")' <<< "$DOCKER_METADATA_OUTPUT_JSON") \
$(printf '${{ env.REGISTRY_IMAGE }}@sha256:%s ' *)
- name: Inspect image
run: |
docker buildx imagetools inspect ${{ env.REGISTRY_IMAGE }}:${{ steps.meta.outputs.version }}
# trivy has their own rate limiting issues causing this action to flake
# we worked around it by hardcoding to different db repos in env
# can re-enable when they figure it out
# https://github.com/aquasecurity/trivy/discussions/7538
# https://github.com/aquasecurity/trivy-action/issues/389
# Security: Using pinned digest (0.65.0@sha256:a22415a38938a56c379387a8163fcb0ce38b10ace73e593475d3658d578b2436)
# Security: No Docker socket mount needed for remote registry scanning
- name: Run Trivy vulnerability scanner
uses: aquasecurity/trivy-action@master
env:
TRIVY_DB_REPOSITORY: "public.ecr.aws/aquasecurity/trivy-db:2"
TRIVY_JAVA_DB_REPOSITORY: "public.ecr.aws/aquasecurity/trivy-java-db:1"
uses: nick-fields/retry@v3
with:
# To run locally: trivy image --severity HIGH,CRITICAL onyxdotapp/onyx-backend
image-ref: docker.io/${{ env.REGISTRY_IMAGE }}:${{ github.ref_name }}
severity: "CRITICAL,HIGH"
trivyignores: ./backend/.trivyignore
timeout_minutes: 30
max_attempts: 3
retry_wait_seconds: 10
command: |
docker run --rm -v $HOME/.cache/trivy:/root/.cache/trivy \
-v ${{ github.workspace }}/backend/.trivyignore:/tmp/.trivyignore:ro \
-e TRIVY_DB_REPOSITORY="public.ecr.aws/aquasecurity/trivy-db:2" \
-e TRIVY_JAVA_DB_REPOSITORY="public.ecr.aws/aquasecurity/trivy-java-db:1" \
-e TRIVY_USERNAME="${{ secrets.DOCKER_USERNAME }}" \
-e TRIVY_PASSWORD="${{ secrets.DOCKER_TOKEN }}" \
aquasec/trivy@sha256:a22415a38938a56c379387a8163fcb0ce38b10ace73e593475d3658d578b2436 \
image \
--skip-version-check \
--timeout 20m \
--severity CRITICAL,HIGH \
--ignorefile /tmp/.trivyignore \
docker.io/${{ env.REGISTRY_IMAGE }}:${{ github.ref_name }}

View File

@@ -4,12 +4,12 @@ name: Build and Push Cloud Web Image on Tag
on:
push:
tags:
- "*"
- "*cloud*"
env:
REGISTRY_IMAGE: onyxdotapp/onyx-web-server-cloud
LATEST_TAG: ${{ contains(github.ref_name, 'latest') }}
DEPLOYMENT: cloud
jobs:
build:
runs-on:
@@ -38,9 +38,10 @@ jobs:
uses: docker/metadata-action@v5
with:
images: ${{ env.REGISTRY_IMAGE }}
flavor: |
latest=false
tags: |
type=raw,value=${{ env.REGISTRY_IMAGE }}:${{ github.ref_name }}
type=raw,value=${{ env.LATEST_TAG == 'true' && format('{0}:latest', env.REGISTRY_IMAGE) || '' }}
type=raw,value=${{ github.ref_name }}
- name: Set up Docker Buildx
uses: docker/setup-buildx-action@v3
@@ -53,7 +54,7 @@ jobs:
- name: Build and push by digest
id: build
uses: docker/build-push-action@v5
uses: docker/build-push-action@v6
with:
context: ./web
file: ./web/Dockerfile
@@ -70,10 +71,12 @@ jobs:
NEXT_PUBLIC_FORGOT_PASSWORD_ENABLED=true
NEXT_PUBLIC_INCLUDE_ERROR_POPUP_SUPPORT_LINK=true
NODE_OPTIONS=--max-old-space-size=8192
# needed due to weird interactions with the builds for different platforms
no-cache: true
labels: ${{ steps.meta.outputs.labels }}
outputs: type=image,name=${{ env.REGISTRY_IMAGE }},push-by-digest=true,name-canonical=true,push=true
cache-from: type=s3,prefix=cache/${{ github.repository }}/${{ env.DEPLOYMENT }}/cloudweb-${{ env.PLATFORM_PAIR }}/,region=${{ env.RUNS_ON_AWS_REGION }},bucket=${{ env.RUNS_ON_S3_BUCKET_CACHE }}
cache-to: type=s3,prefix=cache/${{ github.repository }}/${{ env.DEPLOYMENT }}/cloudweb-${{ env.PLATFORM_PAIR }}/,region=${{ env.RUNS_ON_AWS_REGION }},bucket=${{ env.RUNS_ON_S3_BUCKET_CACHE }},mode=max
# no-cache needed due to weird interactions with the builds for different platforms
# NOTE(rkuo): this may not be true any more with the proper cache prefixing by architecture - currently testing with it off
- name: Export digest
run: |
@@ -84,7 +87,7 @@ jobs:
- name: Upload digest
uses: actions/upload-artifact@v4
with:
name: digests-${{ env.PLATFORM_PAIR }}
name: cloudweb-digests-${{ env.PLATFORM_PAIR }}-${{ github.run_id }}
path: /tmp/digests/*
if-no-files-found: error
retention-days: 1
@@ -98,7 +101,7 @@ jobs:
uses: actions/download-artifact@v4
with:
path: /tmp/digests
pattern: digests-*
pattern: cloudweb-digests-*-${{ github.run_id }}
merge-multiple: true
- name: Set up Docker Buildx
@@ -109,6 +112,10 @@ jobs:
uses: docker/metadata-action@v5
with:
images: ${{ env.REGISTRY_IMAGE }}
flavor: |
latest=false
tags: |
type=raw,value=${{ github.ref_name }}
- name: Login to Docker Hub
uses: docker/login-action@v3
@@ -132,10 +139,20 @@ jobs:
# https://github.com/aquasecurity/trivy/discussions/7538
# https://github.com/aquasecurity/trivy-action/issues/389
- name: Run Trivy vulnerability scanner
uses: aquasecurity/trivy-action@master
env:
TRIVY_DB_REPOSITORY: "public.ecr.aws/aquasecurity/trivy-db:2"
TRIVY_JAVA_DB_REPOSITORY: "public.ecr.aws/aquasecurity/trivy-java-db:1"
uses: nick-fields/retry@v3
with:
image-ref: docker.io/${{ env.REGISTRY_IMAGE }}:${{ github.ref_name }}
severity: "CRITICAL,HIGH"
timeout_minutes: 30
max_attempts: 3
retry_wait_seconds: 10
command: |
docker run --rm -v $HOME/.cache/trivy:/root/.cache/trivy \
-e TRIVY_DB_REPOSITORY="public.ecr.aws/aquasecurity/trivy-db:2" \
-e TRIVY_JAVA_DB_REPOSITORY="public.ecr.aws/aquasecurity/trivy-java-db:1" \
-e TRIVY_USERNAME="${{ secrets.DOCKER_USERNAME }}" \
-e TRIVY_PASSWORD="${{ secrets.DOCKER_TOKEN }}" \
aquasec/trivy@sha256:a22415a38938a56c379387a8163fcb0ce38b10ace73e593475d3658d578b2436 \
image \
--skip-version-check \
--timeout 20m \
--severity CRITICAL,HIGH \
docker.io/${{ env.REGISTRY_IMAGE }}:${{ github.ref_name }}

View File

@@ -7,10 +7,13 @@ on:
env:
REGISTRY_IMAGE: ${{ contains(github.ref_name, 'cloud') && 'onyxdotapp/onyx-model-server-cloud' || 'onyxdotapp/onyx-model-server' }}
LATEST_TAG: ${{ contains(github.ref_name, 'latest') }}
DOCKER_BUILDKIT: 1
BUILDKIT_PROGRESS: plain
DEPLOYMENT: ${{ contains(github.ref_name, 'cloud') && 'cloud' || 'standalone' }}
# tag nightly builds with "edge"
EDGE_TAG: ${{ startsWith(github.ref_name, 'nightly-latest') }}
jobs:
# Bypassing this for now as the idea of not building is glitching
@@ -51,6 +54,8 @@ jobs:
if: needs.check_model_server_changes.outputs.changed == 'true'
runs-on:
[runs-on, runner=8cpu-linux-x64, "run-id=${{ github.run_id }}-amd64"]
env:
PLATFORM_PAIR: linux-amd64
steps:
- name: Checkout code
uses: actions/checkout@v4
@@ -75,7 +80,7 @@ jobs:
password: ${{ secrets.DOCKER_TOKEN }}
- name: Build and Push AMD64
uses: docker/build-push-action@v5
uses: docker/build-push-action@v6
with:
context: ./backend
file: ./backend/Dockerfile.model_server
@@ -83,15 +88,20 @@ jobs:
push: true
tags: ${{ env.REGISTRY_IMAGE }}:${{ github.ref_name }}-amd64
build-args: |
DANSWER_VERSION=${{ github.ref_name }}
ONYX_VERSION=${{ github.ref_name }}
outputs: type=registry
provenance: false
cache-from: type=s3,prefix=cache/${{ github.repository }}/${{ env.DEPLOYMENT }}/model-server-${{ env.PLATFORM_PAIR }}/,region=${{ env.RUNS_ON_AWS_REGION }},bucket=${{ env.RUNS_ON_S3_BUCKET_CACHE }}
cache-to: type=s3,prefix=cache/${{ github.repository }}/${{ env.DEPLOYMENT }}/model-server-${{ env.PLATFORM_PAIR }}/,region=${{ env.RUNS_ON_AWS_REGION }},bucket=${{ env.RUNS_ON_S3_BUCKET_CACHE }},mode=max
# no-cache: true
build-arm64:
needs: [check_model_server_changes]
if: needs.check_model_server_changes.outputs.changed == 'true'
runs-on:
[runs-on, runner=8cpu-linux-x64, "run-id=${{ github.run_id }}-arm64"]
[runs-on, runner=8cpu-linux-arm64, "run-id=${{ github.run_id }}-arm64"]
env:
PLATFORM_PAIR: linux-arm64
steps:
- name: Checkout code
uses: actions/checkout@v4
@@ -116,7 +126,7 @@ jobs:
password: ${{ secrets.DOCKER_TOKEN }}
- name: Build and Push ARM64
uses: docker/build-push-action@v5
uses: docker/build-push-action@v6
with:
context: ./backend
file: ./backend/Dockerfile.model_server
@@ -124,15 +134,26 @@ jobs:
push: true
tags: ${{ env.REGISTRY_IMAGE }}:${{ github.ref_name }}-arm64
build-args: |
DANSWER_VERSION=${{ github.ref_name }}
ONYX_VERSION=${{ github.ref_name }}
outputs: type=registry
provenance: false
cache-from: type=s3,prefix=cache/${{ github.repository }}/${{ env.DEPLOYMENT }}/model-server-${{ env.PLATFORM_PAIR }}/,region=${{ env.RUNS_ON_AWS_REGION }},bucket=${{ env.RUNS_ON_S3_BUCKET_CACHE }}
cache-to: type=s3,prefix=cache/${{ github.repository }}/${{ env.DEPLOYMENT }}/model-server-${{ env.PLATFORM_PAIR }}/,region=${{ env.RUNS_ON_AWS_REGION }},bucket=${{ env.RUNS_ON_S3_BUCKET_CACHE }},mode=max
merge-and-scan:
needs: [build-amd64, build-arm64, check_model_server_changes]
if: needs.check_model_server_changes.outputs.changed == 'true'
runs-on: ubuntu-latest
steps:
- name: Check if stable release version
id: check_version
run: |
if [[ "${{ github.ref_name }}" =~ ^v[0-9]+\.[0-9]+\.[0-9]+$ ]] && [[ "${{ github.ref_name }}" != *"cloud"* ]]; then
echo "is_stable=true" >> $GITHUB_OUTPUT
else
echo "is_stable=false" >> $GITHUB_OUTPUT
fi
- name: Login to Docker Hub
uses: docker/login-action@v3
with:
@@ -145,18 +166,32 @@ jobs:
docker buildx imagetools create -t ${{ env.REGISTRY_IMAGE }}:${{ github.ref_name }} \
${{ env.REGISTRY_IMAGE }}:${{ github.ref_name }}-amd64 \
${{ env.REGISTRY_IMAGE }}:${{ github.ref_name }}-arm64
if [[ "${{ env.LATEST_TAG }}" == "true" ]]; then
if [[ "${{ steps.check_version.outputs.is_stable }}" == "true" ]]; then
docker buildx imagetools create -t ${{ env.REGISTRY_IMAGE }}:latest \
${{ env.REGISTRY_IMAGE }}:${{ github.ref_name }}-amd64 \
${{ env.REGISTRY_IMAGE }}:${{ github.ref_name }}-arm64
fi
if [[ "${{ env.EDGE_TAG }}" == "true" ]]; then
docker buildx imagetools create -t ${{ env.REGISTRY_IMAGE }}:edge \
${{ env.REGISTRY_IMAGE }}:${{ github.ref_name }}-amd64 \
${{ env.REGISTRY_IMAGE }}:${{ github.ref_name }}-arm64
fi
- name: Run Trivy vulnerability scanner
uses: aquasecurity/trivy-action@master
env:
TRIVY_DB_REPOSITORY: "public.ecr.aws/aquasecurity/trivy-db:2"
TRIVY_JAVA_DB_REPOSITORY: "public.ecr.aws/aquasecurity/trivy-java-db:1"
uses: nick-fields/retry@v3
with:
image-ref: docker.io/${{ env.REGISTRY_IMAGE }}:${{ github.ref_name }}
severity: "CRITICAL,HIGH"
timeout: "10m"
timeout_minutes: 30
max_attempts: 3
retry_wait_seconds: 10
command: |
docker run --rm -v $HOME/.cache/trivy:/root/.cache/trivy \
-e TRIVY_DB_REPOSITORY="public.ecr.aws/aquasecurity/trivy-db:2" \
-e TRIVY_JAVA_DB_REPOSITORY="public.ecr.aws/aquasecurity/trivy-java-db:1" \
-e TRIVY_USERNAME="${{ secrets.DOCKER_USERNAME }}" \
-e TRIVY_PASSWORD="${{ secrets.DOCKER_TOKEN }}" \
aquasec/trivy@sha256:a22415a38938a56c379387a8163fcb0ce38b10ace73e593475d3658d578b2436 \
image \
--skip-version-check \
--timeout 20m \
--severity CRITICAL,HIGH \
docker.io/${{ env.REGISTRY_IMAGE }}:${{ github.ref_name }}

View File

@@ -7,10 +7,29 @@ on:
env:
REGISTRY_IMAGE: onyxdotapp/onyx-web-server
LATEST_TAG: ${{ contains(github.ref_name, 'latest') }}
# tag nightly builds with "edge"
EDGE_TAG: ${{ startsWith(github.ref_name, 'nightly-latest') }}
DEPLOYMENT: standalone
jobs:
precheck:
runs-on: [runs-on, runner=2cpu-linux-x64, "run-id=${{ github.run_id }}"]
outputs:
should-run: ${{ steps.set-output.outputs.should-run }}
steps:
- name: Check if tag contains "cloud"
id: set-output
run: |
if [[ "${{ github.ref_name }}" == *cloud* ]]; then
echo "should-run=false" >> "$GITHUB_OUTPUT"
else
echo "should-run=true" >> "$GITHUB_OUTPUT"
fi
build:
needs: precheck
if: needs.precheck.outputs.should-run == 'true'
runs-on:
- runs-on
- runner=${{ matrix.platform == 'linux/amd64' && '8cpu-linux-x64' || '8cpu-linux-arm64' }}
@@ -29,6 +48,15 @@ jobs:
platform=${{ matrix.platform }}
echo "PLATFORM_PAIR=${platform//\//-}" >> $GITHUB_ENV
- name: Check if stable release version
id: check_version
run: |
if [[ "${{ github.ref_name }}" =~ ^v[0-9]+\.[0-9]+\.[0-9]+$ ]]; then
echo "is_stable=true" >> $GITHUB_OUTPUT
else
echo "is_stable=false" >> $GITHUB_OUTPUT
fi
- name: Checkout
uses: actions/checkout@v4
@@ -37,9 +65,12 @@ jobs:
uses: docker/metadata-action@v5
with:
images: ${{ env.REGISTRY_IMAGE }}
flavor: |
latest=false
tags: |
type=raw,value=${{ env.REGISTRY_IMAGE }}:${{ github.ref_name }}
type=raw,value=${{ env.LATEST_TAG == 'true' && format('{0}:latest', env.REGISTRY_IMAGE) || '' }}
type=raw,value=${{ github.ref_name }}
type=raw,value=${{ steps.check_version.outputs.is_stable == 'true' && 'latest' || '' }}
type=raw,value=${{ env.EDGE_TAG == 'true' && 'edge' || '' }}
- name: Set up Docker Buildx
uses: docker/setup-buildx-action@v3
@@ -52,7 +83,7 @@ jobs:
- name: Build and push by digest
id: build
uses: docker/build-push-action@v5
uses: docker/build-push-action@v6
with:
context: ./web
file: ./web/Dockerfile
@@ -62,11 +93,13 @@ jobs:
ONYX_VERSION=${{ github.ref_name }}
NODE_OPTIONS=--max-old-space-size=8192
# needed due to weird interactions with the builds for different platforms
no-cache: true
labels: ${{ steps.meta.outputs.labels }}
outputs: type=image,name=${{ env.REGISTRY_IMAGE }},push-by-digest=true,name-canonical=true,push=true
cache-from: type=s3,prefix=cache/${{ github.repository }}/${{ env.DEPLOYMENT }}/web-${{ env.PLATFORM_PAIR }}/,region=${{ env.RUNS_ON_AWS_REGION }},bucket=${{ env.RUNS_ON_S3_BUCKET_CACHE }}
cache-to: type=s3,prefix=cache/${{ github.repository }}/${{ env.DEPLOYMENT }}/web-${{ env.PLATFORM_PAIR }}/,region=${{ env.RUNS_ON_AWS_REGION }},bucket=${{ env.RUNS_ON_S3_BUCKET_CACHE }},mode=max
# no-cache needed due to weird interactions with the builds for different platforms
# NOTE(rkuo): this may not be true any more with the proper cache prefixing by architecture - currently testing with it off
- name: Export digest
run: |
mkdir -p /tmp/digests
@@ -76,21 +109,31 @@ jobs:
- name: Upload digest
uses: actions/upload-artifact@v4
with:
name: digests-${{ env.PLATFORM_PAIR }}
name: web-digests-${{ env.PLATFORM_PAIR }}-${{ github.run_id }}
path: /tmp/digests/*
if-no-files-found: error
retention-days: 1
merge:
runs-on: ubuntu-latest
needs:
- build
if: needs.precheck.outputs.should-run == 'true'
runs-on: ubuntu-latest
steps:
- name: Check if stable release version
id: check_version
run: |
if [[ "${{ github.ref_name }}" =~ ^v[0-9]+\.[0-9]+\.[0-9]+$ ]] && [[ "${{ github.ref_name }}" != *"cloud"* ]]; then
echo "is_stable=true" >> $GITHUB_OUTPUT
else
echo "is_stable=false" >> $GITHUB_OUTPUT
fi
- name: Download digests
uses: actions/download-artifact@v4
with:
path: /tmp/digests
pattern: digests-*
pattern: web-digests-*-${{ github.run_id }}
merge-multiple: true
- name: Set up Docker Buildx
@@ -101,6 +144,12 @@ jobs:
uses: docker/metadata-action@v5
with:
images: ${{ env.REGISTRY_IMAGE }}
flavor: |
latest=false
tags: |
type=raw,value=${{ github.ref_name }}
type=raw,value=${{ steps.check_version.outputs.is_stable == 'true' && 'latest' || '' }}
type=raw,value=${{ env.EDGE_TAG == 'true' && 'edge' || '' }}
- name: Login to Docker Hub
uses: docker/login-action@v3
@@ -124,10 +173,20 @@ jobs:
# https://github.com/aquasecurity/trivy/discussions/7538
# https://github.com/aquasecurity/trivy-action/issues/389
- name: Run Trivy vulnerability scanner
uses: aquasecurity/trivy-action@master
env:
TRIVY_DB_REPOSITORY: "public.ecr.aws/aquasecurity/trivy-db:2"
TRIVY_JAVA_DB_REPOSITORY: "public.ecr.aws/aquasecurity/trivy-java-db:1"
uses: nick-fields/retry@v3
with:
image-ref: docker.io/${{ env.REGISTRY_IMAGE }}:${{ github.ref_name }}
severity: "CRITICAL,HIGH"
timeout_minutes: 30
max_attempts: 3
retry_wait_seconds: 10
command: |
docker run --rm -v $HOME/.cache/trivy:/root/.cache/trivy \
-e TRIVY_DB_REPOSITORY="public.ecr.aws/aquasecurity/trivy-db:2" \
-e TRIVY_JAVA_DB_REPOSITORY="public.ecr.aws/aquasecurity/trivy-java-db:1" \
-e TRIVY_USERNAME="${{ secrets.DOCKER_USERNAME }}" \
-e TRIVY_PASSWORD="${{ secrets.DOCKER_TOKEN }}" \
aquasec/trivy@sha256:a22415a38938a56c379387a8163fcb0ce38b10ace73e593475d3658d578b2436 \
image \
--skip-version-check \
--timeout 20m \
--severity CRITICAL,HIGH \
docker.io/${{ env.REGISTRY_IMAGE }}:${{ github.ref_name }}

View File

@@ -35,3 +35,7 @@ jobs:
- name: Pull, Tag and Push API Server Image
run: |
docker buildx imagetools create -t onyxdotapp/onyx-backend:latest onyxdotapp/onyx-backend:${{ github.event.inputs.version }}
- name: Pull, Tag and Push Model Server Image
run: |
docker buildx imagetools create -t onyxdotapp/onyx-model-server:latest onyxdotapp/onyx-model-server:${{ github.event.inputs.version }}

View File

@@ -0,0 +1,52 @@
name: Release Onyx Helm Charts
on:
push:
branches:
- main
permissions: write-all
jobs:
release:
permissions:
contents: write
runs-on: ubuntu-latest
steps:
- name: Checkout
uses: actions/checkout@v4
with:
fetch-depth: 0
- name: Install Helm CLI
uses: azure/setup-helm@v4
with:
version: v3.12.1
- name: Add required Helm repositories
run: |
helm repo add ingress-nginx https://kubernetes.github.io/ingress-nginx
helm repo add onyx-vespa https://onyx-dot-app.github.io/vespa-helm-charts
helm repo add cloudnative-pg https://cloudnative-pg.github.io/charts
helm repo add ot-container-kit https://ot-container-kit.github.io/helm-charts
helm repo add minio https://charts.min.io/
helm repo update
- name: Build chart dependencies
run: |
set -euo pipefail
for chart_dir in deployment/helm/charts/*; do
if [ -f "$chart_dir/Chart.yaml" ]; then
echo "Building dependencies for $chart_dir"
helm dependency build "$chart_dir"
fi
done
- name: Publish Helm charts to gh-pages
uses: stefanprodan/helm-gh-pages@v1.7.0
with:
token: ${{ secrets.GITHUB_TOKEN }}
charts_dir: deployment/helm/charts
branch: gh-pages
commit_username: ${{ github.actor }}
commit_email: ${{ github.actor }}@users.noreply.github.com

View File

@@ -1,171 +0,0 @@
# This workflow is intended to be manually triggered via the GitHub Action tab.
# Given a hotfix branch, it will attempt to open a PR to all release branches and
# by default auto merge them
name: Hotfix release branches
on:
workflow_dispatch:
inputs:
hotfix_commit:
description: "Hotfix commit hash"
required: true
hotfix_suffix:
description: "Hotfix branch suffix (e.g. hotfix/v0.8-{suffix})"
required: true
release_branch_pattern:
description: "Release branch pattern (regex)"
required: true
default: "release/.*"
auto_merge:
description: "Automatically merge the hotfix PRs"
required: true
type: choice
default: "true"
options:
- true
- false
jobs:
hotfix_release_branches:
permissions: write-all
# See https://runs-on.com/runners/linux/
# use a lower powered instance since this just does i/o to docker hub
runs-on: [runs-on, runner=2cpu-linux-x64, "run-id=${{ github.run_id }}"]
steps:
# needs RKUO_DEPLOY_KEY for write access to merge PR's
- name: Checkout Repository
uses: actions/checkout@v4
with:
ssh-key: "${{ secrets.RKUO_DEPLOY_KEY }}"
fetch-depth: 0
- name: Set up Git user
run: |
git config user.name "Richard Kuo [bot]"
git config user.email "rkuo[bot]@onyx.app"
- name: Fetch All Branches
run: |
git fetch --all --prune
- name: Verify Hotfix Commit Exists
run: |
git rev-parse --verify "${{ github.event.inputs.hotfix_commit }}" || { echo "Commit not found: ${{ github.event.inputs.hotfix_commit }}"; exit 1; }
- name: Get Release Branches
id: get_release_branches
run: |
BRANCHES=$(git branch -r | grep -E "${{ github.event.inputs.release_branch_pattern }}" | sed 's|origin/||' | tr -d ' ')
if [ -z "$BRANCHES" ]; then
echo "No release branches found matching pattern '${{ github.event.inputs.release_branch_pattern }}'."
exit 1
fi
echo "Found release branches:"
echo "$BRANCHES"
# Join the branches into a single line separated by commas
BRANCHES_JOINED=$(echo "$BRANCHES" | tr '\n' ',' | sed 's/,$//')
# Set the branches as an output
echo "branches=$BRANCHES_JOINED" >> $GITHUB_OUTPUT
# notes on all the vagaries of wiring up automated PR's
# https://github.com/peter-evans/create-pull-request/blob/main/docs/concepts-guidelines.md#triggering-further-workflow-runs
# we must use a custom token for GH_TOKEN to trigger the subsequent PR checks
- name: Create and Merge Pull Requests to Matching Release Branches
env:
HOTFIX_COMMIT: ${{ github.event.inputs.hotfix_commit }}
HOTFIX_SUFFIX: ${{ github.event.inputs.hotfix_suffix }}
AUTO_MERGE: ${{ github.event.inputs.auto_merge }}
GH_TOKEN: ${{ secrets.RKUO_PERSONAL_ACCESS_TOKEN }}
run: |
# Get the branches from the previous step
BRANCHES="${{ steps.get_release_branches.outputs.branches }}"
# Convert BRANCHES to an array
IFS=$',' read -ra BRANCH_ARRAY <<< "$BRANCHES"
# Loop through each release branch and create and merge a PR
for RELEASE_BRANCH in "${BRANCH_ARRAY[@]}"; do
echo "Processing $RELEASE_BRANCH..."
# Parse out the release version by removing "release/" from the branch name
RELEASE_VERSION=${RELEASE_BRANCH#release/}
echo "Release version parsed: $RELEASE_VERSION"
HOTFIX_BRANCH="hotfix/${RELEASE_VERSION}-${HOTFIX_SUFFIX}"
echo "Creating PR from $HOTFIX_BRANCH to $RELEASE_BRANCH"
# Checkout the release branch
echo "Checking out $RELEASE_BRANCH"
git checkout "$RELEASE_BRANCH"
# Create the new hotfix branch
if git rev-parse --verify "$HOTFIX_BRANCH" >/dev/null 2>&1; then
echo "Hotfix branch $HOTFIX_BRANCH already exists. Skipping branch creation."
else
echo "Branching $RELEASE_BRANCH to $HOTFIX_BRANCH"
git checkout -b "$HOTFIX_BRANCH"
fi
# Check if the hotfix commit is a merge commit
if git rev-list --merges -n 1 "$HOTFIX_COMMIT" >/dev/null 2>&1; then
# -m 1 uses the target branch as the base (which is what we want)
echo "Hotfix commit $HOTFIX_COMMIT is a merge commit, using -m 1 for cherry-pick"
CHERRY_PICK_CMD="git cherry-pick -m 1 $HOTFIX_COMMIT"
else
CHERRY_PICK_CMD="git cherry-pick $HOTFIX_COMMIT"
fi
# Perform the cherry-pick
echo "Executing: $CHERRY_PICK_CMD"
eval "$CHERRY_PICK_CMD"
if [ $? -ne 0 ]; then
echo "Cherry-pick failed for $HOTFIX_COMMIT on $HOTFIX_BRANCH. Aborting..."
git cherry-pick --abort
continue
fi
# Push the hotfix branch to the remote
echo "Pushing $HOTFIX_BRANCH..."
git push origin "$HOTFIX_BRANCH"
echo "Hotfix branch $HOTFIX_BRANCH created and pushed."
# Check if PR already exists
EXISTING_PR=$(gh pr list --head "$HOTFIX_BRANCH" --base "$RELEASE_BRANCH" --state open --json number --jq '.[0].number')
if [ -n "$EXISTING_PR" ]; then
echo "An open PR already exists: #$EXISTING_PR. Skipping..."
continue
fi
# Create a new PR and capture the output
PR_OUTPUT=$(gh pr create --title "Merge $HOTFIX_BRANCH into $RELEASE_BRANCH" \
--body "Automated PR to merge \`$HOTFIX_BRANCH\` into \`$RELEASE_BRANCH\`." \
--head "$HOTFIX_BRANCH" --base "$RELEASE_BRANCH")
# Extract the URL from the output
PR_URL=$(echo "$PR_OUTPUT" | grep -Eo 'https://github.com/[^ ]+')
echo "Pull request created: $PR_URL"
# Extract PR number from URL
PR_NUMBER=$(basename "$PR_URL")
echo "Pull request created: $PR_NUMBER"
if [ "$AUTO_MERGE" == "true" ]; then
echo "Attempting to merge pull request #$PR_NUMBER"
# Attempt to merge the PR
gh pr merge "$PR_NUMBER" --merge --auto --delete-branch
if [ $? -eq 0 ]; then
echo "Pull request #$PR_NUMBER merged successfully."
else
# Optionally, handle the error or continue
echo "Failed to merge pull request #$PR_NUMBER."
fi
fi
done

View File

@@ -1,124 +0,0 @@
name: Backport on Merge
# Note this workflow does not trigger the builds, be sure to manually tag the branches to trigger the builds
on:
pull_request:
types: [closed] # Later we check for merge so only PRs that go in can get backported
permissions:
contents: write
actions: write
jobs:
backport:
if: github.event.pull_request.merged == true
runs-on: ubuntu-latest
env:
GITHUB_TOKEN: ${{ secrets.YUHONG_GH_ACTIONS }}
steps:
- name: Checkout code
uses: actions/checkout@v4
with:
ssh-key: "${{ secrets.RKUO_DEPLOY_KEY }}"
fetch-depth: 0
- name: Set up Git user
run: |
git config user.name "Richard Kuo [bot]"
git config user.email "rkuo[bot]@onyx.app"
git fetch --prune
- name: Check for Backport Checkbox
id: checkbox-check
run: |
PR_BODY="${{ github.event.pull_request.body }}"
if [[ "$PR_BODY" == *"[x] This PR should be backported"* ]]; then
echo "backport=true" >> $GITHUB_OUTPUT
else
echo "backport=false" >> $GITHUB_OUTPUT
fi
- name: List and sort release branches
id: list-branches
run: |
git fetch --all --tags
BRANCHES=$(git for-each-ref --format='%(refname:short)' refs/remotes/origin/release/* | sed 's|origin/release/||' | sort -Vr)
BETA=$(echo "$BRANCHES" | head -n 1)
STABLE=$(echo "$BRANCHES" | head -n 2 | tail -n 1)
echo "beta=release/$BETA" >> $GITHUB_OUTPUT
echo "stable=release/$STABLE" >> $GITHUB_OUTPUT
# Fetch latest tags for beta and stable
LATEST_BETA_TAG=$(git tag -l "v[0-9]*.[0-9]*.[0-9]*-beta.[0-9]*" | grep -E "^v[0-9]+\.[0-9]+\.[0-9]+-beta\.[0-9]+$" | grep -v -- "-cloud" | sort -Vr | head -n 1)
LATEST_STABLE_TAG=$(git tag -l "v[0-9]*.[0-9]*.[0-9]*" | grep -E "^v[0-9]+\.[0-9]+\.[0-9]+$" | sort -Vr | head -n 1)
# Handle case where no beta tags exist
if [[ -z "$LATEST_BETA_TAG" ]]; then
NEW_BETA_TAG="v1.0.0-beta.1"
else
NEW_BETA_TAG=$(echo $LATEST_BETA_TAG | awk -F '[.-]' '{print $1 "." $2 "." $3 "-beta." ($NF+1)}')
fi
# Increment latest stable tag
NEW_STABLE_TAG=$(echo $LATEST_STABLE_TAG | awk -F '.' '{print $1 "." $2 "." ($3+1)}')
echo "latest_beta_tag=$LATEST_BETA_TAG" >> $GITHUB_OUTPUT
echo "latest_stable_tag=$LATEST_STABLE_TAG" >> $GITHUB_OUTPUT
echo "new_beta_tag=$NEW_BETA_TAG" >> $GITHUB_OUTPUT
echo "new_stable_tag=$NEW_STABLE_TAG" >> $GITHUB_OUTPUT
- name: Echo branch and tag information
run: |
echo "Beta branch: ${{ steps.list-branches.outputs.beta }}"
echo "Stable branch: ${{ steps.list-branches.outputs.stable }}"
echo "Latest beta tag: ${{ steps.list-branches.outputs.latest_beta_tag }}"
echo "Latest stable tag: ${{ steps.list-branches.outputs.latest_stable_tag }}"
echo "New beta tag: ${{ steps.list-branches.outputs.new_beta_tag }}"
echo "New stable tag: ${{ steps.list-branches.outputs.new_stable_tag }}"
- name: Trigger Backport
if: steps.checkbox-check.outputs.backport == 'true'
run: |
set -e
echo "Backporting to beta ${{ steps.list-branches.outputs.beta }} and stable ${{ steps.list-branches.outputs.stable }}"
# Echo the merge commit SHA
echo "Merge commit SHA: ${{ github.event.pull_request.merge_commit_sha }}"
# Fetch all history for all branches and tags
git fetch --prune
# Reset and prepare the beta branch
git checkout ${{ steps.list-branches.outputs.beta }}
echo "Last 5 commits on beta branch:"
git log -n 5 --pretty=format:"%H"
echo "" # Newline for formatting
# Cherry-pick the merge commit from the merged PR
git cherry-pick -m 1 ${{ github.event.pull_request.merge_commit_sha }} || {
echo "Cherry-pick to beta failed due to conflicts."
exit 1
}
# Create new beta branch/tag
git tag ${{ steps.list-branches.outputs.new_beta_tag }}
# Push the changes and tag to the beta branch using PAT
git push origin ${{ steps.list-branches.outputs.beta }}
git push origin ${{ steps.list-branches.outputs.new_beta_tag }}
# Reset and prepare the stable branch
git checkout ${{ steps.list-branches.outputs.stable }}
echo "Last 5 commits on stable branch:"
git log -n 5 --pretty=format:"%H"
echo "" # Newline for formatting
# Cherry-pick the merge commit from the merged PR
git cherry-pick -m 1 ${{ github.event.pull_request.merge_commit_sha }} || {
echo "Cherry-pick to stable failed due to conflicts."
exit 1
}
# Create new stable branch/tag
git tag ${{ steps.list-branches.outputs.new_stable_tag }}
# Push the changes and tag to the stable branch using PAT
git push origin ${{ steps.list-branches.outputs.stable }}
git push origin ${{ steps.list-branches.outputs.new_stable_tag }}

View File

@@ -0,0 +1,110 @@
name: External Dependency Unit Tests
on:
merge_group:
pull_request:
branches: [main]
env:
# AWS
S3_AWS_ACCESS_KEY_ID: ${{ secrets.S3_AWS_ACCESS_KEY_ID }}
S3_AWS_SECRET_ACCESS_KEY: ${{ secrets.S3_AWS_SECRET_ACCESS_KEY }}
# MinIO
S3_ENDPOINT_URL: "http://localhost:9004"
# Confluence
CONFLUENCE_TEST_SPACE_URL: ${{ vars.CONFLUENCE_TEST_SPACE_URL }}
CONFLUENCE_TEST_SPACE: ${{ vars.CONFLUENCE_TEST_SPACE }}
CONFLUENCE_TEST_PAGE_ID: ${{ secrets.CONFLUENCE_TEST_PAGE_ID }}
CONFLUENCE_USER_NAME: ${{ vars.CONFLUENCE_USER_NAME }}
CONFLUENCE_ACCESS_TOKEN: ${{ secrets.CONFLUENCE_ACCESS_TOKEN }}
CONFLUENCE_ACCESS_TOKEN_SCOPED: ${{ secrets.CONFLUENCE_ACCESS_TOKEN_SCOPED }}
# LLMs
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
jobs:
discover-test-dirs:
runs-on: ubuntu-latest
outputs:
test-dirs: ${{ steps.set-matrix.outputs.test-dirs }}
steps:
- name: Checkout code
uses: actions/checkout@v4
- name: Discover test directories
id: set-matrix
run: |
# Find all subdirectories in backend/tests/external_dependency_unit
dirs=$(find backend/tests/external_dependency_unit -mindepth 1 -maxdepth 1 -type d -exec basename {} \; | sort | jq -R -s -c 'split("\n")[:-1]')
echo "test-dirs=$dirs" >> $GITHUB_OUTPUT
external-dependency-unit-tests:
needs: discover-test-dirs
# Use larger runner with more resources for Vespa
runs-on: [runs-on, runner=16cpu-linux-x64, "run-id=${{ github.run_id }}"]
strategy:
fail-fast: false
matrix:
test-dir: ${{ fromJson(needs.discover-test-dirs.outputs.test-dirs) }}
env:
PYTHONPATH: ./backend
MODEL_SERVER_HOST: "disabled"
steps:
- name: Checkout code
uses: actions/checkout@v4
- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: "3.11"
cache: "pip"
cache-dependency-path: |
backend/requirements/default.txt
backend/requirements/dev.txt
- name: Install Dependencies
run: |
python -m pip install --upgrade pip
pip install --retries 5 --timeout 30 -r backend/requirements/default.txt
pip install --retries 5 --timeout 30 -r backend/requirements/dev.txt
playwright install chromium
playwright install-deps chromium
- name: Set up Standard Dependencies
run: |
cd deployment/docker_compose
docker compose -f docker-compose.yml -f docker-compose.dev.yml up -d minio relational_db cache index
- name: Wait for services
run: |
echo "Waiting for services to be ready..."
sleep 30
# Wait for Vespa specifically
echo "Waiting for Vespa to be ready..."
timeout 300 bash -c 'until curl -f -s http://localhost:8081/ApplicationStatus > /dev/null 2>&1; do echo "Vespa not ready, waiting..."; sleep 10; done' || echo "Vespa timeout - continuing anyway"
echo "Services should be ready now"
- name: Run migrations
run: |
cd backend
# Run migrations to head
alembic upgrade head
alembic heads --verbose
- name: Run Tests for ${{ matrix.test-dir }}
shell: script -q -e -c "bash --noprofile --norc -eo pipefail {0}"
run: |
py.test \
--durations=8 \
-o junit_family=xunit2 \
-xv \
--ff \
backend/tests/external_dependency_unit/${{ matrix.test-dir }}

View File

@@ -37,6 +37,11 @@ jobs:
echo "changed=true" >> "$GITHUB_OUTPUT"
fi
# uncomment to force run chart-testing
# - name: Force run chart-testing (list-changed)
# id: list-changed
# run: echo "changed=true" >> $GITHUB_OUTPUT
# lint all charts if any changes were detected
- name: Run chart-testing (lint)
if: steps.list-changed.outputs.changed == 'true'
@@ -48,9 +53,176 @@ jobs:
if: steps.list-changed.outputs.changed == 'true'
uses: helm/kind-action@v1.12.0
- name: Run chart-testing (install)
- name: Pre-install cluster status check
if: steps.list-changed.outputs.changed == 'true'
run: ct install --all --helm-extra-set-args="--set=nginx.enabled=false" --debug --config ct.yaml
run: |
echo "=== Pre-install Cluster Status ==="
kubectl get nodes -o wide
kubectl get pods --all-namespaces
kubectl get storageclass
- name: Add Helm repositories and update
if: steps.list-changed.outputs.changed == 'true'
run: |
echo "=== Adding Helm repositories ==="
helm repo add ingress-nginx https://kubernetes.github.io/ingress-nginx
helm repo add vespa https://onyx-dot-app.github.io/vespa-helm-charts
helm repo add cloudnative-pg https://cloudnative-pg.github.io/charts
helm repo add ot-container-kit https://ot-container-kit.github.io/helm-charts
helm repo add minio https://charts.min.io/
helm repo update
- name: Install Redis operator
if: steps.list-changed.outputs.changed == 'true'
shell: bash
run: |
echo "=== Installing redis-operator CRDs ==="
helm upgrade --install redis-operator ot-container-kit/redis-operator \
--namespace redis-operator --create-namespace --wait --timeout 300s
- name: Pre-pull required images
if: steps.list-changed.outputs.changed == 'true'
run: |
echo "=== Pre-pulling required images to avoid timeout ==="
KIND_CLUSTER=$(kubectl config current-context | sed 's/kind-//')
echo "Kind cluster: $KIND_CLUSTER"
IMAGES=(
"ghcr.io/cloudnative-pg/cloudnative-pg:1.27.0"
"quay.io/opstree/redis:v7.0.15"
"docker.io/onyxdotapp/onyx-web-server:latest"
)
for image in "${IMAGES[@]}"; do
echo "Pre-pulling $image"
if docker pull "$image"; then
kind load docker-image "$image" --name "$KIND_CLUSTER" || echo "Failed to load $image into kind"
else
echo "Failed to pull $image"
fi
done
echo "=== Images loaded into Kind cluster ==="
docker exec "$KIND_CLUSTER"-control-plane crictl images | grep -E "(cloudnative-pg|redis|onyx)" || echo "Some images may still be loading..."
- name: Validate chart dependencies
if: steps.list-changed.outputs.changed == 'true'
run: |
echo "=== Validating chart dependencies ==="
cd deployment/helm/charts/onyx
helm dependency update
helm lint .
- name: Run chart-testing (install) with enhanced monitoring
timeout-minutes: 25
if: steps.list-changed.outputs.changed == 'true'
run: |
echo "=== Starting chart installation with monitoring ==="
# Function to monitor cluster state
monitor_cluster() {
while true; do
echo "=== Cluster Status Check at $(date) ==="
# Only show non-running pods to reduce noise
NON_RUNNING_PODS=$(kubectl get pods --all-namespaces --field-selector=status.phase!=Running,status.phase!=Succeeded --no-headers 2>/dev/null | wc -l)
if [ "$NON_RUNNING_PODS" -gt 0 ]; then
echo "Non-running pods:"
kubectl get pods --all-namespaces --field-selector=status.phase!=Running,status.phase!=Succeeded
else
echo "All pods running successfully"
fi
# Only show recent events if there are issues
RECENT_EVENTS=$(kubectl get events --sort-by=.lastTimestamp --all-namespaces --field-selector=type!=Normal 2>/dev/null | tail -5)
if [ -n "$RECENT_EVENTS" ]; then
echo "Recent warnings/errors:"
echo "$RECENT_EVENTS"
fi
sleep 60
done
}
# Start monitoring in background
monitor_cluster &
MONITOR_PID=$!
# Set up cleanup
cleanup() {
echo "=== Cleaning up monitoring process ==="
kill $MONITOR_PID 2>/dev/null || true
echo "=== Final cluster state ==="
kubectl get pods --all-namespaces
kubectl get events --all-namespaces --sort-by=.lastTimestamp | tail -20
}
# Trap cleanup on exit
trap cleanup EXIT
# Run the actual installation with detailed logging
echo "=== Starting ct install ==="
set +e
ct install --all \
--helm-extra-set-args="\
--set=nginx.enabled=false \
--set=minio.enabled=false \
--set=vespa.enabled=false \
--set=slackbot.enabled=false \
--set=postgresql.enabled=true \
--set=postgresql.nameOverride=cloudnative-pg \
--set=postgresql.cluster.storage.storageClass=standard \
--set=redis.enabled=true \
--set=redis.storageSpec.volumeClaimTemplate.spec.storageClassName=standard \
--set=webserver.replicaCount=1 \
--set=api.replicaCount=0 \
--set=inferenceCapability.replicaCount=0 \
--set=indexCapability.replicaCount=0 \
--set=celery_beat.replicaCount=0 \
--set=celery_worker_heavy.replicaCount=0 \
--set=celery_worker_docfetching.replicaCount=0 \
--set=celery_worker_docprocessing.replicaCount=0 \
--set=celery_worker_light.replicaCount=0 \
--set=celery_worker_monitoring.replicaCount=0 \
--set=celery_worker_primary.replicaCount=0 \
--set=celery_worker_user_file_processing.replicaCount=0 \
--set=celery_worker_user_files_indexing.replicaCount=0" \
--helm-extra-args="--timeout 900s --debug" \
--debug --config ct.yaml
CT_EXIT=$?
set -e
if [[ $CT_EXIT -ne 0 ]]; then
echo "ct install failed with exit code $CT_EXIT"
exit $CT_EXIT
else
echo "=== Installation completed successfully ==="
fi
kubectl get pods --all-namespaces
- name: Post-install verification
if: steps.list-changed.outputs.changed == 'true'
run: |
echo "=== Post-install verification ==="
kubectl get pods --all-namespaces
kubectl get services --all-namespaces
# Only show issues if they exist
kubectl describe pods --all-namespaces | grep -A 5 -B 2 "Failed\|Error\|Warning" || echo "No pod issues found"
- name: Cleanup on failure
if: failure() && steps.list-changed.outputs.changed == 'true'
run: |
echo "=== Cleanup on failure ==="
echo "=== Final cluster state ==="
kubectl get pods --all-namespaces
kubectl get events --all-namespaces --sort-by=.lastTimestamp | tail -10
echo "=== Pod descriptions for debugging ==="
kubectl describe pods --all-namespaces | grep -A 10 -B 3 "Failed\|Error\|Warning\|Pending" || echo "No problematic pods found"
echo "=== Recent logs for debugging ==="
kubectl logs --all-namespaces --tail=50 | grep -i "error\|timeout\|failed\|pull" || echo "No error logs found"
echo "=== Helm releases ==="
helm list --all-namespaces
# the following would install only changed charts, but we only have one chart so
# don't worry about that for now
# run: ct install --target-branch ${{ github.event.repository.default_branch }}

View File

@@ -11,142 +11,211 @@ on:
- "release/**"
env:
# Private Registry Configuration
PRIVATE_REGISTRY: experimental-registry.blacksmith.sh:5000
PRIVATE_REGISTRY_USERNAME: ${{ secrets.PRIVATE_REGISTRY_USERNAME }}
PRIVATE_REGISTRY_PASSWORD: ${{ secrets.PRIVATE_REGISTRY_PASSWORD }}
# Test Environment Variables
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
SLACK_BOT_TOKEN: ${{ secrets.SLACK_BOT_TOKEN }}
CONFLUENCE_TEST_SPACE_URL: ${{ secrets.CONFLUENCE_TEST_SPACE_URL }}
CONFLUENCE_USER_NAME: ${{ secrets.CONFLUENCE_USER_NAME }}
CONFLUENCE_TEST_SPACE_URL: ${{ vars.CONFLUENCE_TEST_SPACE_URL }}
CONFLUENCE_USER_NAME: ${{ vars.CONFLUENCE_USER_NAME }}
CONFLUENCE_ACCESS_TOKEN: ${{ secrets.CONFLUENCE_ACCESS_TOKEN }}
CONFLUENCE_ACCESS_TOKEN_SCOPED: ${{ secrets.CONFLUENCE_ACCESS_TOKEN_SCOPED }}
JIRA_BASE_URL: ${{ secrets.JIRA_BASE_URL }}
JIRA_USER_EMAIL: ${{ secrets.JIRA_USER_EMAIL }}
JIRA_API_TOKEN: ${{ secrets.JIRA_API_TOKEN }}
JIRA_API_TOKEN_SCOPED: ${{ secrets.JIRA_API_TOKEN_SCOPED }}
PERM_SYNC_SHAREPOINT_CLIENT_ID: ${{ secrets.PERM_SYNC_SHAREPOINT_CLIENT_ID }}
PERM_SYNC_SHAREPOINT_PRIVATE_KEY: ${{ secrets.PERM_SYNC_SHAREPOINT_PRIVATE_KEY }}
PERM_SYNC_SHAREPOINT_CERTIFICATE_PASSWORD: ${{ secrets.PERM_SYNC_SHAREPOINT_CERTIFICATE_PASSWORD }}
PERM_SYNC_SHAREPOINT_DIRECTORY_ID: ${{ secrets.PERM_SYNC_SHAREPOINT_DIRECTORY_ID }}
jobs:
integration-tests:
# See https://runs-on.com/runners/linux/
runs-on: [runs-on, runner=32cpu-linux-x64, "run-id=${{ github.run_id }}"]
discover-test-dirs:
runs-on: blacksmith-2vcpu-ubuntu-2404-arm
outputs:
test-dirs: ${{ steps.set-matrix.outputs.test-dirs }}
steps:
- name: Checkout code
uses: actions/checkout@v4
- name: Set up Docker Buildx
uses: docker/setup-buildx-action@v3
- name: Discover test directories
id: set-matrix
run: |
# Find all leaf-level directories in both test directories
tests_dirs=$(find backend/tests/integration/tests -mindepth 1 -maxdepth 1 -type d ! -name "__pycache__" -exec basename {} \; | sort)
connector_dirs=$(find backend/tests/integration/connector_job_tests -mindepth 1 -maxdepth 1 -type d ! -name "__pycache__" -exec basename {} \; | sort)
# Create JSON array with directory info
all_dirs=""
for dir in $tests_dirs; do
all_dirs="$all_dirs{\"path\":\"tests/$dir\",\"name\":\"tests-$dir\"},"
done
for dir in $connector_dirs; do
all_dirs="$all_dirs{\"path\":\"connector_job_tests/$dir\",\"name\":\"connector-$dir\"},"
done
# Remove trailing comma and wrap in array
all_dirs="[${all_dirs%,}]"
echo "test-dirs=$all_dirs" >> $GITHUB_OUTPUT
prepare-build:
runs-on: blacksmith-2vcpu-ubuntu-2404-arm
steps:
- name: Checkout code
uses: actions/checkout@v4
- name: Prepare build
uses: ./.github/actions/prepare-build
build-backend-image:
runs-on: blacksmith-16vcpu-ubuntu-2404-arm
steps:
- name: Checkout code
uses: actions/checkout@v4
- name: Login to Private Registry
uses: docker/login-action@v3
with:
registry: ${{ env.PRIVATE_REGISTRY }}
username: ${{ env.PRIVATE_REGISTRY_USERNAME }}
password: ${{ env.PRIVATE_REGISTRY_PASSWORD }}
- name: Set up Docker Buildx
uses: useblacksmith/setup-docker-builder@v1
- name: Build and push Backend Docker image
uses: useblacksmith/build-push-action@v2
with:
context: ./backend
file: ./backend/Dockerfile
platforms: linux/arm64
tags: ${{ env.PRIVATE_REGISTRY }}/integration-test-onyx-backend:test-${{ github.run_id }}
push: true
outputs: type=registry
no-cache: ${{ vars.DOCKER_NO_CACHE == 'true' }}
build-model-server-image:
runs-on: blacksmith-16vcpu-ubuntu-2404-arm
steps:
- name: Checkout code
uses: actions/checkout@v4
- name: Login to Private Registry
uses: docker/login-action@v3
with:
registry: ${{ env.PRIVATE_REGISTRY }}
username: ${{ env.PRIVATE_REGISTRY_USERNAME }}
password: ${{ env.PRIVATE_REGISTRY_PASSWORD }}
- name: Set up Docker Buildx
uses: useblacksmith/setup-docker-builder@v1
- name: Build and push Model Server Docker image
uses: useblacksmith/build-push-action@v2
with:
context: ./backend
file: ./backend/Dockerfile.model_server
platforms: linux/arm64
tags: ${{ env.PRIVATE_REGISTRY }}/integration-test-onyx-model-server:test-${{ github.run_id }}
push: true
outputs: type=registry
provenance: false
no-cache: ${{ vars.DOCKER_NO_CACHE == 'true' }}
build-integration-image:
needs: prepare-build
runs-on: blacksmith-16vcpu-ubuntu-2404-arm
steps:
- name: Checkout code
uses: actions/checkout@v4
- name: Login to Private Registry
uses: docker/login-action@v3
with:
registry: ${{ env.PRIVATE_REGISTRY }}
username: ${{ env.PRIVATE_REGISTRY_USERNAME }}
password: ${{ env.PRIVATE_REGISTRY_PASSWORD }}
- name: Download OpenAPI artifacts
uses: actions/download-artifact@v4
with:
name: openapi-artifacts
path: backend/generated/
- name: Set up Docker Buildx
uses: useblacksmith/setup-docker-builder@v1
- name: Build and push integration test Docker image
uses: useblacksmith/build-push-action@v2
with:
context: ./backend
file: ./backend/tests/integration/Dockerfile
platforms: linux/arm64
tags: ${{ env.PRIVATE_REGISTRY }}/integration-test-onyx-integration:test-${{ github.run_id }}
push: true
outputs: type=registry
no-cache: ${{ vars.DOCKER_NO_CACHE == 'true' }}
integration-tests:
needs:
[
discover-test-dirs,
build-backend-image,
build-model-server-image,
build-integration-image,
]
runs-on: blacksmith-8vcpu-ubuntu-2404-arm
strategy:
fail-fast: false
matrix:
test-dir: ${{ fromJson(needs.discover-test-dirs.outputs.test-dirs) }}
steps:
- name: Checkout code
uses: actions/checkout@v4
- name: Login to Private Registry
uses: docker/login-action@v3
with:
registry: ${{ env.PRIVATE_REGISTRY }}
username: ${{ env.PRIVATE_REGISTRY_USERNAME }}
password: ${{ env.PRIVATE_REGISTRY_PASSWORD }}
# needed for pulling Vespa, Redis, Postgres, and Minio images
# otherwise, we hit the "Unauthenticated users" limit
# https://docs.docker.com/docker-hub/usage/
- name: Login to Docker Hub
uses: docker/login-action@v3
with:
username: ${{ secrets.DOCKER_USERNAME }}
password: ${{ secrets.DOCKER_TOKEN }}
# tag every docker image with "test" so that we can spin up the correct set
# of images during testing
# We don't need to build the Web Docker image since it's not yet used
# in the integration tests. We have a separate action to verify that it builds
# successfully.
- name: Pull Web Docker image
- name: Pull Docker images
run: |
docker pull onyxdotapp/onyx-web-server:latest
docker tag onyxdotapp/onyx-web-server:latest onyxdotapp/onyx-web-server:test
# Pull all images from registry in parallel
echo "Pulling Docker images in parallel..."
# Pull images from private registry
(docker pull --platform linux/arm64 ${{ env.PRIVATE_REGISTRY }}/integration-test-onyx-backend:test-${{ github.run_id }}) &
(docker pull --platform linux/arm64 ${{ env.PRIVATE_REGISTRY }}/integration-test-onyx-model-server:test-${{ github.run_id }}) &
(docker pull --platform linux/arm64 ${{ env.PRIVATE_REGISTRY }}/integration-test-onyx-integration:test-${{ github.run_id }}) &
# we use the runs-on cache for docker builds
# in conjunction with runs-on runners, it has better speed and unlimited caching
# https://runs-on.com/caching/s3-cache-for-github-actions/
# https://runs-on.com/caching/docker/
# https://github.com/moby/buildkit#s3-cache-experimental
# Wait for all background jobs to complete
wait
echo "All Docker images pulled successfully"
# images are built and run locally for testing purposes. Not pushed.
- name: Build Backend Docker image
uses: ./.github/actions/custom-build-and-push
with:
context: ./backend
file: ./backend/Dockerfile
platforms: linux/amd64
tags: onyxdotapp/onyx-backend:test
push: false
load: true
cache-from: type=s3,prefix=cache/${{ github.repository }}/integration-tests/backend/,region=${{ env.RUNS_ON_AWS_REGION }},bucket=${{ env.RUNS_ON_S3_BUCKET_CACHE }}
cache-to: type=s3,prefix=cache/${{ github.repository }}/integration-tests/backend/,region=${{ env.RUNS_ON_AWS_REGION }},bucket=${{ env.RUNS_ON_S3_BUCKET_CACHE }},mode=max
- name: Build Model Server Docker image
uses: ./.github/actions/custom-build-and-push
with:
context: ./backend
file: ./backend/Dockerfile.model_server
platforms: linux/amd64
tags: onyxdotapp/onyx-model-server:test
push: false
load: true
cache-from: type=s3,prefix=cache/${{ github.repository }}/integration-tests/model-server/,region=${{ env.RUNS_ON_AWS_REGION }},bucket=${{ env.RUNS_ON_S3_BUCKET_CACHE }}
cache-to: type=s3,prefix=cache/${{ github.repository }}/integration-tests/model-server/,region=${{ env.RUNS_ON_AWS_REGION }},bucket=${{ env.RUNS_ON_S3_BUCKET_CACHE }},mode=max
- name: Build integration test Docker image
uses: ./.github/actions/custom-build-and-push
with:
context: ./backend
file: ./backend/tests/integration/Dockerfile
platforms: linux/amd64
tags: onyxdotapp/onyx-integration:test
push: false
load: true
cache-from: type=s3,prefix=cache/${{ github.repository }}/integration-tests/integration/,region=${{ env.RUNS_ON_AWS_REGION }},bucket=${{ env.RUNS_ON_S3_BUCKET_CACHE }}
cache-to: type=s3,prefix=cache/${{ github.repository }}/integration-tests/integration/,region=${{ env.RUNS_ON_AWS_REGION }},bucket=${{ env.RUNS_ON_S3_BUCKET_CACHE }},mode=max
# Start containers for multi-tenant tests
- name: Start Docker containers for multi-tenant tests
run: |
cd deployment/docker_compose
ENABLE_PAID_ENTERPRISE_EDITION_FEATURES=true \
MULTI_TENANT=true \
AUTH_TYPE=cloud \
REQUIRE_EMAIL_VERIFICATION=false \
DISABLE_TELEMETRY=true \
IMAGE_TAG=test \
DEV_MODE=true \
docker compose -f docker-compose.multitenant-dev.yml -p onyx-stack up -d
id: start_docker_multi_tenant
# In practice, `cloud` Auth type would require OAUTH credentials to be set.
- name: Run Multi-Tenant Integration Tests
run: |
echo "Waiting for 3 minutes to ensure API server is ready..."
sleep 180
echo "Running integration tests..."
docker run --rm --network onyx-stack_default \
--name test-runner \
-e POSTGRES_HOST=relational_db \
-e POSTGRES_USER=postgres \
-e POSTGRES_PASSWORD=password \
-e POSTGRES_DB=postgres \
-e POSTGRES_USE_NULL_POOL=true \
-e VESPA_HOST=index \
-e REDIS_HOST=cache \
-e API_SERVER_HOST=api_server \
-e OPENAI_API_KEY=${OPENAI_API_KEY} \
-e SLACK_BOT_TOKEN=${SLACK_BOT_TOKEN} \
-e TEST_WEB_HOSTNAME=test-runner \
-e AUTH_TYPE=cloud \
-e MULTI_TENANT=true \
-e REQUIRE_EMAIL_VERIFICATION=false \
-e DISABLE_TELEMETRY=true \
-e IMAGE_TAG=test \
-e DEV_MODE=true \
onyxdotapp/onyx-integration:test \
/app/tests/integration/multitenant_tests
continue-on-error: true
id: run_multitenant_tests
- name: Check multi-tenant test results
run: |
if [ ${{ steps.run_multitenant_tests.outcome }} == 'failure' ]; then
echo "Multi-tenant integration tests failed. Exiting with error."
exit 1
else
echo "All multi-tenant integration tests passed successfully."
fi
- name: Stop multi-tenant Docker containers
run: |
cd deployment/docker_compose
docker compose -f docker-compose.multitenant-dev.yml -p onyx-stack down -v
# Re-tag to remove registry prefix for docker-compose
docker tag ${{ env.PRIVATE_REGISTRY }}/integration-test-onyx-backend:test-${{ github.run_id }} onyxdotapp/onyx-backend:test
docker tag ${{ env.PRIVATE_REGISTRY }}/integration-test-onyx-model-server:test-${{ github.run_id }} onyxdotapp/onyx-model-server:test
docker tag ${{ env.PRIVATE_REGISTRY }}/integration-test-onyx-integration:test-${{ github.run_id }} onyxdotapp/onyx-integration:test
# NOTE: Use pre-ping/null pool to reduce flakiness due to dropped connections
# NOTE: don't need web server for integration tests
- name: Start Docker containers
run: |
cd deployment/docker_compose
@@ -158,14 +227,24 @@ jobs:
DISABLE_TELEMETRY=true \
IMAGE_TAG=test \
INTEGRATION_TESTS_MODE=true \
docker compose -f docker-compose.dev.yml -p onyx-stack up -d
CHECK_TTL_MANAGEMENT_TASK_FREQUENCY_IN_HOURS=0.001 \
docker compose -f docker-compose.yml -f docker-compose.dev.yml up \
relational_db \
index \
cache \
minio \
api_server \
inference_model_server \
indexing_model_server \
background \
-d
id: start_docker
- name: Wait for service to be ready
run: |
echo "Starting wait-for-service script..."
docker logs -f onyx-stack-api_server-1 &
docker logs -f onyx-api_server-1 &
start_time=$(date +%s)
timeout=300 # 5 minutes in seconds
@@ -201,43 +280,46 @@ jobs:
docker compose -f docker-compose.mock-it-services.yml \
-p mock-it-services-stack up -d
# NOTE: Use pre-ping/null to reduce flakiness due to dropped connections
- name: Run Standard Integration Tests
run: |
echo "Running integration tests..."
docker run --rm --network onyx-stack_default \
--name test-runner \
-e POSTGRES_HOST=relational_db \
-e POSTGRES_USER=postgres \
-e POSTGRES_PASSWORD=password \
-e POSTGRES_DB=postgres \
-e POSTGRES_POOL_PRE_PING=true \
-e POSTGRES_USE_NULL_POOL=true \
-e VESPA_HOST=index \
-e REDIS_HOST=cache \
-e API_SERVER_HOST=api_server \
-e OPENAI_API_KEY=${OPENAI_API_KEY} \
-e SLACK_BOT_TOKEN=${SLACK_BOT_TOKEN} \
-e CONFLUENCE_TEST_SPACE_URL=${CONFLUENCE_TEST_SPACE_URL} \
-e CONFLUENCE_USER_NAME=${CONFLUENCE_USER_NAME} \
-e CONFLUENCE_ACCESS_TOKEN=${CONFLUENCE_ACCESS_TOKEN} \
-e TEST_WEB_HOSTNAME=test-runner \
-e MOCK_CONNECTOR_SERVER_HOST=mock_connector_server \
-e MOCK_CONNECTOR_SERVER_PORT=8001 \
onyxdotapp/onyx-integration:test \
/app/tests/integration/tests \
/app/tests/integration/connector_job_tests
continue-on-error: true
id: run_tests
- name: Check test results
run: |
if [ ${{ steps.run_tests.outcome }} == 'failure' ]; then
echo "Integration tests failed. Exiting with error."
exit 1
else
echo "All integration tests passed successfully."
fi
- name: Run Integration Tests for ${{ matrix.test-dir.name }}
uses: nick-fields/retry@v3
with:
timeout_minutes: 20
max_attempts: 3
retry_wait_seconds: 10
command: |
echo "Running integration tests for ${{ matrix.test-dir.path }}..."
docker run --rm --network onyx_default \
--name test-runner \
-e POSTGRES_HOST=relational_db \
-e POSTGRES_USER=postgres \
-e POSTGRES_PASSWORD=password \
-e POSTGRES_DB=postgres \
-e DB_READONLY_USER=db_readonly_user \
-e DB_READONLY_PASSWORD=password \
-e POSTGRES_POOL_PRE_PING=true \
-e POSTGRES_USE_NULL_POOL=true \
-e VESPA_HOST=index \
-e REDIS_HOST=cache \
-e API_SERVER_HOST=api_server \
-e OPENAI_API_KEY=${OPENAI_API_KEY} \
-e SLACK_BOT_TOKEN=${SLACK_BOT_TOKEN} \
-e CONFLUENCE_TEST_SPACE_URL=${CONFLUENCE_TEST_SPACE_URL} \
-e CONFLUENCE_USER_NAME=${CONFLUENCE_USER_NAME} \
-e CONFLUENCE_ACCESS_TOKEN=${CONFLUENCE_ACCESS_TOKEN} \
-e CONFLUENCE_ACCESS_TOKEN_SCOPED=${CONFLUENCE_ACCESS_TOKEN_SCOPED} \
-e JIRA_BASE_URL=${JIRA_BASE_URL} \
-e JIRA_USER_EMAIL=${JIRA_USER_EMAIL} \
-e JIRA_API_TOKEN=${JIRA_API_TOKEN} \
-e JIRA_API_TOKEN_SCOPED=${JIRA_API_TOKEN_SCOPED} \
-e PERM_SYNC_SHAREPOINT_CLIENT_ID=${PERM_SYNC_SHAREPOINT_CLIENT_ID} \
-e PERM_SYNC_SHAREPOINT_PRIVATE_KEY="${PERM_SYNC_SHAREPOINT_PRIVATE_KEY}" \
-e PERM_SYNC_SHAREPOINT_CERTIFICATE_PASSWORD=${PERM_SYNC_SHAREPOINT_CERTIFICATE_PASSWORD} \
-e PERM_SYNC_SHAREPOINT_DIRECTORY_ID=${PERM_SYNC_SHAREPOINT_DIRECTORY_ID} \
-e TEST_WEB_HOSTNAME=test-runner \
-e MOCK_CONNECTOR_SERVER_HOST=mock_connector_server \
-e MOCK_CONNECTOR_SERVER_PORT=8001 \
onyxdotapp/onyx-integration:test \
/app/tests/integration/${{ matrix.test-dir.path }}
# ------------------------------------------------------------
# Always gather logs BEFORE "down":
@@ -245,19 +327,19 @@ jobs:
if: always()
run: |
cd deployment/docker_compose
docker compose -f docker-compose.dev.yml -p onyx-stack logs --no-color api_server > $GITHUB_WORKSPACE/api_server.log || true
docker compose logs --no-color api_server > $GITHUB_WORKSPACE/api_server.log || true
- name: Dump all-container logs (optional)
if: always()
run: |
cd deployment/docker_compose
docker compose -f docker-compose.dev.yml -p onyx-stack logs --no-color > $GITHUB_WORKSPACE/docker-compose.log || true
docker compose logs --no-color > $GITHUB_WORKSPACE/docker-compose.log || true
- name: Upload logs
if: always()
uses: actions/upload-artifact@v4
with:
name: docker-all-logs
name: docker-all-logs-${{ matrix.test-dir.name }}
path: ${{ github.workspace }}/docker-compose.log
# ------------------------------------------------------------
@@ -265,4 +347,158 @@ jobs:
if: always()
run: |
cd deployment/docker_compose
docker compose -f docker-compose.dev.yml -p onyx-stack down -v
docker compose down -v
multitenant-tests:
needs:
[
build-backend-image,
build-model-server-image,
build-integration-image,
]
runs-on: blacksmith-8vcpu-ubuntu-2404-arm
steps:
- name: Checkout code
uses: actions/checkout@v4
- name: Login to Private Registry
uses: docker/login-action@v3
with:
registry: ${{ env.PRIVATE_REGISTRY }}
username: ${{ env.PRIVATE_REGISTRY_USERNAME }}
password: ${{ env.PRIVATE_REGISTRY_PASSWORD }}
- name: Login to Docker Hub
uses: docker/login-action@v3
with:
username: ${{ secrets.DOCKER_USERNAME }}
password: ${{ secrets.DOCKER_TOKEN }}
- name: Pull Docker images
run: |
(docker pull --platform linux/arm64 ${{ env.PRIVATE_REGISTRY }}/integration-test-onyx-backend:test-${{ github.run_id }}) &
(docker pull --platform linux/arm64 ${{ env.PRIVATE_REGISTRY }}/integration-test-onyx-model-server:test-${{ github.run_id }}) &
(docker pull --platform linux/arm64 ${{ env.PRIVATE_REGISTRY }}/integration-test-onyx-integration:test-${{ github.run_id }}) &
wait
docker tag ${{ env.PRIVATE_REGISTRY }}/integration-test-onyx-backend:test-${{ github.run_id }} onyxdotapp/onyx-backend:test
docker tag ${{ env.PRIVATE_REGISTRY }}/integration-test-onyx-model-server:test-${{ github.run_id }} onyxdotapp/onyx-model-server:test
docker tag ${{ env.PRIVATE_REGISTRY }}/integration-test-onyx-integration:test-${{ github.run_id }} onyxdotapp/onyx-integration:test
- name: Start Docker containers for multi-tenant tests
run: |
cd deployment/docker_compose
ENABLE_PAID_ENTERPRISE_EDITION_FEATURES=true \
MULTI_TENANT=true \
AUTH_TYPE=cloud \
REQUIRE_EMAIL_VERIFICATION=false \
DISABLE_TELEMETRY=true \
IMAGE_TAG=test \
DEV_MODE=true \
docker compose -f docker-compose.multitenant-dev.yml up \
relational_db \
index \
cache \
minio \
api_server \
inference_model_server \
indexing_model_server \
background \
-d
id: start_docker_multi_tenant
- name: Wait for service to be ready (multi-tenant)
run: |
echo "Starting wait-for-service script for multi-tenant..."
docker logs -f onyx-api_server-1 &
start_time=$(date +%s)
timeout=300
while true; do
current_time=$(date +%s)
elapsed_time=$((current_time - start_time))
if [ $elapsed_time -ge $timeout ]; then
echo "Timeout reached. Service did not become ready in 5 minutes."
exit 1
fi
response=$(curl -s -o /dev/null -w "%{http_code}" http://localhost:8080/health || echo "curl_error")
if [ "$response" = "200" ]; then
echo "Service is ready!"
break
elif [ "$response" = "curl_error" ]; then
echo "Curl encountered an error; retrying..."
else
echo "Service not ready yet (HTTP $response). Retrying in 5 seconds..."
fi
sleep 5
done
echo "Finished waiting for service."
- name: Run Multi-Tenant Integration Tests
run: |
echo "Running multi-tenant integration tests..."
docker run --rm --network onyx_default \
--name test-runner \
-e POSTGRES_HOST=relational_db \
-e POSTGRES_USER=postgres \
-e POSTGRES_PASSWORD=password \
-e DB_READONLY_USER=db_readonly_user \
-e DB_READONLY_PASSWORD=password \
-e POSTGRES_DB=postgres \
-e POSTGRES_USE_NULL_POOL=true \
-e VESPA_HOST=index \
-e REDIS_HOST=cache \
-e API_SERVER_HOST=api_server \
-e OPENAI_API_KEY=${OPENAI_API_KEY} \
-e SLACK_BOT_TOKEN=${SLACK_BOT_TOKEN} \
-e TEST_WEB_HOSTNAME=test-runner \
-e AUTH_TYPE=cloud \
-e MULTI_TENANT=true \
-e SKIP_RESET=true \
-e REQUIRE_EMAIL_VERIFICATION=false \
-e DISABLE_TELEMETRY=true \
-e IMAGE_TAG=test \
-e DEV_MODE=true \
onyxdotapp/onyx-integration:test \
/app/tests/integration/multitenant_tests
- name: Dump API server logs (multi-tenant)
if: always()
run: |
cd deployment/docker_compose
docker compose -f docker-compose.multitenant-dev.yml logs --no-color api_server > $GITHUB_WORKSPACE/api_server_multitenant.log || true
- name: Dump all-container logs (multi-tenant)
if: always()
run: |
cd deployment/docker_compose
docker compose -f docker-compose.multitenant-dev.yml logs --no-color > $GITHUB_WORKSPACE/docker-compose-multitenant.log || true
- name: Upload logs (multi-tenant)
if: always()
uses: actions/upload-artifact@v4
with:
name: docker-all-logs-multitenant
path: ${{ github.workspace }}/docker-compose-multitenant.log
- name: Stop multi-tenant Docker containers
if: always()
run: |
cd deployment/docker_compose
docker compose -f docker-compose.multitenant-dev.yml down -v
required:
runs-on: blacksmith-2vcpu-ubuntu-2404-arm
needs: [integration-tests, multitenant-tests]
if: ${{ always() }}
steps:
- uses: actions/github-script@v7
with:
script: |
const needs = ${{ toJSON(needs) }};
const failed = Object.values(needs).some(n => n.result !== 'success');
if (failed) {
core.setFailed('One or more upstream jobs failed or were cancelled.');
} else {
core.notice('All required jobs succeeded.');
}

35
.github/workflows/pr-jest-tests.yml vendored Normal file
View File

@@ -0,0 +1,35 @@
name: Run Jest Tests
concurrency:
group: Run-Jest-Tests-${{ github.workflow }}-${{ github.head_ref || github.event.workflow_run.head_branch || github.run_id }}
cancel-in-progress: true
on: push
jobs:
jest-tests:
name: Jest Tests
runs-on: ubuntu-latest
steps:
- name: Checkout code
uses: actions/checkout@v4
- name: Setup node
uses: actions/setup-node@v4
with:
node-version: 22
- name: Install node dependencies
working-directory: ./web
run: npm ci
- name: Run Jest tests
working-directory: ./web
run: npm test -- --ci --coverage --maxWorkers=50%
- name: Upload coverage reports
if: always()
uses: actions/upload-artifact@v4
with:
name: jest-coverage-${{ github.run_id }}
path: ./web/coverage
retention-days: 7

38
.github/workflows/pr-labeler.yml vendored Normal file
View File

@@ -0,0 +1,38 @@
name: PR Labeler
on:
pull_request_target:
branches:
- main
types:
- opened
- reopened
- synchronize
- edited
permissions:
contents: read
pull-requests: write
jobs:
validate_pr_title:
runs-on: ubuntu-latest
steps:
- name: Check PR title for Conventional Commits
env:
PR_TITLE: ${{ github.event.pull_request.title }}
run: |
echo "PR Title: $PR_TITLE"
if [[ ! "$PR_TITLE" =~ ^(feat|fix|docs|test|ci|refactor|perf|chore|revert|build)(\(.+\))?:\ .+ ]]; then
echo "::error::❌ Your PR title does not follow the Conventional Commits format.
This check ensures that all pull requests use clear, consistent titles that help automate changelogs and improve project history.
Please update your PR title to follow the Conventional Commits style.
Here is a link to a blog explaining the reason why we've included the Conventional Commits style into our PR titles: https://xfuture-blog.com/working-with-conventional-commits
**Here are some examples of valid PR titles:**
- feat: add user authentication
- fix(login): handle null password error
- docs(readme): update installation instructions"
exit 1
fi

View File

@@ -5,90 +5,216 @@ concurrency:
on:
merge_group:
pull_request:
branches:
- main
- "release/**"
types: [checks_requested]
env:
# Private Registry Configuration
PRIVATE_REGISTRY: experimental-registry.blacksmith.sh:5000
PRIVATE_REGISTRY_USERNAME: ${{ secrets.PRIVATE_REGISTRY_USERNAME }}
PRIVATE_REGISTRY_PASSWORD: ${{ secrets.PRIVATE_REGISTRY_PASSWORD }}
# Test Environment Variables
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
SLACK_BOT_TOKEN: ${{ secrets.SLACK_BOT_TOKEN }}
CONFLUENCE_TEST_SPACE_URL: ${{ secrets.CONFLUENCE_TEST_SPACE_URL }}
CONFLUENCE_USER_NAME: ${{ secrets.CONFLUENCE_USER_NAME }}
CONFLUENCE_TEST_SPACE_URL: ${{ vars.CONFLUENCE_TEST_SPACE_URL }}
CONFLUENCE_USER_NAME: ${{ vars.CONFLUENCE_USER_NAME }}
CONFLUENCE_ACCESS_TOKEN: ${{ secrets.CONFLUENCE_ACCESS_TOKEN }}
CONFLUENCE_ACCESS_TOKEN_SCOPED: ${{ secrets.CONFLUENCE_ACCESS_TOKEN_SCOPED }}
JIRA_BASE_URL: ${{ secrets.JIRA_BASE_URL }}
JIRA_USER_EMAIL: ${{ secrets.JIRA_USER_EMAIL }}
JIRA_API_TOKEN: ${{ secrets.JIRA_API_TOKEN }}
JIRA_API_TOKEN_SCOPED: ${{ secrets.JIRA_API_TOKEN_SCOPED }}
PERM_SYNC_SHAREPOINT_CLIENT_ID: ${{ secrets.PERM_SYNC_SHAREPOINT_CLIENT_ID }}
PERM_SYNC_SHAREPOINT_PRIVATE_KEY: ${{ secrets.PERM_SYNC_SHAREPOINT_PRIVATE_KEY }}
PERM_SYNC_SHAREPOINT_CERTIFICATE_PASSWORD: ${{ secrets.PERM_SYNC_SHAREPOINT_CERTIFICATE_PASSWORD }}
PERM_SYNC_SHAREPOINT_DIRECTORY_ID: ${{ secrets.PERM_SYNC_SHAREPOINT_DIRECTORY_ID }}
jobs:
integration-tests-mit:
# See https://runs-on.com/runners/linux/
runs-on: [runs-on, runner=32cpu-linux-x64, "run-id=${{ github.run_id }}"]
discover-test-dirs:
runs-on: blacksmith-2vcpu-ubuntu-2404-arm
outputs:
test-dirs: ${{ steps.set-matrix.outputs.test-dirs }}
steps:
- name: Checkout code
uses: actions/checkout@v4
- name: Set up Docker Buildx
uses: docker/setup-buildx-action@v3
- name: Discover test directories
id: set-matrix
run: |
# Find all leaf-level directories in both test directories
tests_dirs=$(find backend/tests/integration/tests -mindepth 1 -maxdepth 1 -type d ! -name "__pycache__" -exec basename {} \; | sort)
connector_dirs=$(find backend/tests/integration/connector_job_tests -mindepth 1 -maxdepth 1 -type d ! -name "__pycache__" -exec basename {} \; | sort)
# Create JSON array with directory info
all_dirs=""
for dir in $tests_dirs; do
all_dirs="$all_dirs{\"path\":\"tests/$dir\",\"name\":\"tests-$dir\"},"
done
for dir in $connector_dirs; do
all_dirs="$all_dirs{\"path\":\"connector_job_tests/$dir\",\"name\":\"connector-$dir\"},"
done
# Remove trailing comma and wrap in array
all_dirs="[${all_dirs%,}]"
echo "test-dirs=$all_dirs" >> $GITHUB_OUTPUT
prepare-build:
runs-on: blacksmith-2vcpu-ubuntu-2404-arm
steps:
- name: Checkout code
uses: actions/checkout@v4
- name: Prepare build
uses: ./.github/actions/prepare-build
build-backend-image:
runs-on: blacksmith-16vcpu-ubuntu-2404-arm
steps:
- name: Checkout code
uses: actions/checkout@v4
- name: Login to Private Registry
uses: docker/login-action@v3
with:
registry: ${{ env.PRIVATE_REGISTRY }}
username: ${{ env.PRIVATE_REGISTRY_USERNAME }}
password: ${{ env.PRIVATE_REGISTRY_PASSWORD }}
- name: Set up Docker Buildx
uses: useblacksmith/setup-docker-builder@v1
- name: Build and push Backend Docker image
uses: useblacksmith/build-push-action@v2
with:
context: ./backend
file: ./backend/Dockerfile
platforms: linux/arm64
tags: ${{ env.PRIVATE_REGISTRY }}/integration-test-onyx-backend:test-${{ github.run_id }}
push: true
outputs: type=registry
no-cache: ${{ vars.DOCKER_NO_CACHE == 'true' }}
build-model-server-image:
runs-on: blacksmith-16vcpu-ubuntu-2404-arm
steps:
- name: Checkout code
uses: actions/checkout@v4
- name: Login to Private Registry
uses: docker/login-action@v3
with:
registry: ${{ env.PRIVATE_REGISTRY }}
username: ${{ env.PRIVATE_REGISTRY_USERNAME }}
password: ${{ env.PRIVATE_REGISTRY_PASSWORD }}
- name: Set up Docker Buildx
uses: useblacksmith/setup-docker-builder@v1
- name: Build and push Model Server Docker image
uses: useblacksmith/build-push-action@v2
with:
context: ./backend
file: ./backend/Dockerfile.model_server
platforms: linux/arm64
tags: ${{ env.PRIVATE_REGISTRY }}/integration-test-onyx-model-server:test-${{ github.run_id }}
push: true
outputs: type=registry
provenance: false
no-cache: ${{ vars.DOCKER_NO_CACHE == 'true' }}
build-integration-image:
needs: prepare-build
runs-on: blacksmith-16vcpu-ubuntu-2404-arm
steps:
- name: Checkout code
uses: actions/checkout@v4
- name: Login to Private Registry
uses: docker/login-action@v3
with:
registry: ${{ env.PRIVATE_REGISTRY }}
username: ${{ env.PRIVATE_REGISTRY_USERNAME }}
password: ${{ env.PRIVATE_REGISTRY_PASSWORD }}
- name: Download OpenAPI artifacts
uses: actions/download-artifact@v4
with:
name: openapi-artifacts
path: backend/generated/
- name: Set up Docker Buildx
uses: useblacksmith/setup-docker-builder@v1
- name: Build and push integration test Docker image
uses: useblacksmith/build-push-action@v2
with:
context: ./backend
file: ./backend/tests/integration/Dockerfile
platforms: linux/arm64
tags: ${{ env.PRIVATE_REGISTRY }}/integration-test-onyx-integration:test-${{ github.run_id }}
push: true
outputs: type=registry
no-cache: ${{ vars.DOCKER_NO_CACHE == 'true' }}
integration-tests-mit:
needs:
[
discover-test-dirs,
build-backend-image,
build-model-server-image,
build-integration-image,
]
# See https://docs.blacksmith.sh/blacksmith-runners/overview
runs-on: blacksmith-8vcpu-ubuntu-2404-arm
strategy:
fail-fast: false
matrix:
test-dir: ${{ fromJson(needs.discover-test-dirs.outputs.test-dirs) }}
steps:
- name: Checkout code
uses: actions/checkout@v4
- name: Login to Private Registry
uses: docker/login-action@v3
with:
registry: ${{ env.PRIVATE_REGISTRY }}
username: ${{ env.PRIVATE_REGISTRY_USERNAME }}
password: ${{ env.PRIVATE_REGISTRY_PASSWORD }}
# needed for pulling Vespa, Redis, Postgres, and Minio images
# otherwise, we hit the "Unauthenticated users" limit
# https://docs.docker.com/docker-hub/usage/
- name: Login to Docker Hub
uses: docker/login-action@v3
with:
username: ${{ secrets.DOCKER_USERNAME }}
password: ${{ secrets.DOCKER_TOKEN }}
# tag every docker image with "test" so that we can spin up the correct set
# of images during testing
# We don't need to build the Web Docker image since it's not yet used
# in the integration tests. We have a separate action to verify that it builds
# successfully.
- name: Pull Web Docker image
- name: Pull Docker images
run: |
docker pull onyxdotapp/onyx-web-server:latest
docker tag onyxdotapp/onyx-web-server:latest onyxdotapp/onyx-web-server:test
# Pull all images from registry in parallel
echo "Pulling Docker images in parallel..."
# Pull images from private registry
(docker pull --platform linux/arm64 ${{ env.PRIVATE_REGISTRY }}/integration-test-onyx-backend:test-${{ github.run_id }}) &
(docker pull --platform linux/arm64 ${{ env.PRIVATE_REGISTRY }}/integration-test-onyx-model-server:test-${{ github.run_id }}) &
(docker pull --platform linux/arm64 ${{ env.PRIVATE_REGISTRY }}/integration-test-onyx-integration:test-${{ github.run_id }}) &
# we use the runs-on cache for docker builds
# in conjunction with runs-on runners, it has better speed and unlimited caching
# https://runs-on.com/caching/s3-cache-for-github-actions/
# https://runs-on.com/caching/docker/
# https://github.com/moby/buildkit#s3-cache-experimental
# Wait for all background jobs to complete
wait
echo "All Docker images pulled successfully"
# images are built and run locally for testing purposes. Not pushed.
- name: Build Backend Docker image
uses: ./.github/actions/custom-build-and-push
with:
context: ./backend
file: ./backend/Dockerfile
platforms: linux/amd64
tags: onyxdotapp/onyx-backend:test
push: false
load: true
cache-from: type=s3,prefix=cache/${{ github.repository }}/integration-tests/backend/,region=${{ env.RUNS_ON_AWS_REGION }},bucket=${{ env.RUNS_ON_S3_BUCKET_CACHE }}
cache-to: type=s3,prefix=cache/${{ github.repository }}/integration-tests/backend/,region=${{ env.RUNS_ON_AWS_REGION }},bucket=${{ env.RUNS_ON_S3_BUCKET_CACHE }},mode=max
- name: Build Model Server Docker image
uses: ./.github/actions/custom-build-and-push
with:
context: ./backend
file: ./backend/Dockerfile.model_server
platforms: linux/amd64
tags: onyxdotapp/onyx-model-server:test
push: false
load: true
cache-from: type=s3,prefix=cache/${{ github.repository }}/integration-tests/model-server/,region=${{ env.RUNS_ON_AWS_REGION }},bucket=${{ env.RUNS_ON_S3_BUCKET_CACHE }}
cache-to: type=s3,prefix=cache/${{ github.repository }}/integration-tests/model-server/,region=${{ env.RUNS_ON_AWS_REGION }},bucket=${{ env.RUNS_ON_S3_BUCKET_CACHE }},mode=max
- name: Build integration test Docker image
uses: ./.github/actions/custom-build-and-push
with:
context: ./backend
file: ./backend/tests/integration/Dockerfile
platforms: linux/amd64
tags: onyxdotapp/onyx-integration:test
push: false
load: true
cache-from: type=s3,prefix=cache/${{ github.repository }}/integration-tests/integration/,region=${{ env.RUNS_ON_AWS_REGION }},bucket=${{ env.RUNS_ON_S3_BUCKET_CACHE }}
cache-to: type=s3,prefix=cache/${{ github.repository }}/integration-tests/integration/,region=${{ env.RUNS_ON_AWS_REGION }},bucket=${{ env.RUNS_ON_S3_BUCKET_CACHE }},mode=max
# Re-tag to remove registry prefix for docker-compose
docker tag ${{ env.PRIVATE_REGISTRY }}/integration-test-onyx-backend:test-${{ github.run_id }} onyxdotapp/onyx-backend:test
docker tag ${{ env.PRIVATE_REGISTRY }}/integration-test-onyx-model-server:test-${{ github.run_id }} onyxdotapp/onyx-model-server:test
docker tag ${{ env.PRIVATE_REGISTRY }}/integration-test-onyx-integration:test-${{ github.run_id }} onyxdotapp/onyx-integration:test
# NOTE: Use pre-ping/null pool to reduce flakiness due to dropped connections
# NOTE: don't need web server for integration tests
- name: Start Docker containers
run: |
cd deployment/docker_compose
@@ -99,14 +225,23 @@ jobs:
DISABLE_TELEMETRY=true \
IMAGE_TAG=test \
INTEGRATION_TESTS_MODE=true \
docker compose -f docker-compose.dev.yml -p onyx-stack up -d
docker compose -f docker-compose.yml -f docker-compose.dev.yml up \
relational_db \
index \
cache \
minio \
api_server \
inference_model_server \
indexing_model_server \
background \
-d
id: start_docker
- name: Wait for service to be ready
run: |
echo "Starting wait-for-service script..."
docker logs -f onyx-stack-api_server-1 &
docker logs -f onyx-api_server-1 &
start_time=$(date +%s)
timeout=300 # 5 minutes in seconds
@@ -143,42 +278,46 @@ jobs:
-p mock-it-services-stack up -d
# NOTE: Use pre-ping/null to reduce flakiness due to dropped connections
- name: Run Standard Integration Tests
run: |
echo "Running integration tests..."
docker run --rm --network onyx-stack_default \
--name test-runner \
-e POSTGRES_HOST=relational_db \
-e POSTGRES_USER=postgres \
-e POSTGRES_PASSWORD=password \
-e POSTGRES_DB=postgres \
-e POSTGRES_POOL_PRE_PING=true \
-e POSTGRES_USE_NULL_POOL=true \
-e VESPA_HOST=index \
-e REDIS_HOST=cache \
-e API_SERVER_HOST=api_server \
-e OPENAI_API_KEY=${OPENAI_API_KEY} \
-e SLACK_BOT_TOKEN=${SLACK_BOT_TOKEN} \
-e CONFLUENCE_TEST_SPACE_URL=${CONFLUENCE_TEST_SPACE_URL} \
-e CONFLUENCE_USER_NAME=${CONFLUENCE_USER_NAME} \
-e CONFLUENCE_ACCESS_TOKEN=${CONFLUENCE_ACCESS_TOKEN} \
-e TEST_WEB_HOSTNAME=test-runner \
-e MOCK_CONNECTOR_SERVER_HOST=mock_connector_server \
-e MOCK_CONNECTOR_SERVER_PORT=8001 \
onyxdotapp/onyx-integration:test \
/app/tests/integration/tests \
/app/tests/integration/connector_job_tests
continue-on-error: true
id: run_tests
- name: Check test results
run: |
if [ ${{ steps.run_tests.outcome }} == 'failure' ]; then
echo "Integration tests failed. Exiting with error."
exit 1
else
echo "All integration tests passed successfully."
fi
- name: Run Integration Tests for ${{ matrix.test-dir.name }}
uses: nick-fields/retry@v3
with:
timeout_minutes: 20
max_attempts: 3
retry_wait_seconds: 10
command: |
echo "Running integration tests for ${{ matrix.test-dir.path }}..."
docker run --rm --network onyx_default \
--name test-runner \
-e POSTGRES_HOST=relational_db \
-e POSTGRES_USER=postgres \
-e POSTGRES_PASSWORD=password \
-e POSTGRES_DB=postgres \
-e DB_READONLY_USER=db_readonly_user \
-e DB_READONLY_PASSWORD=password \
-e POSTGRES_POOL_PRE_PING=true \
-e POSTGRES_USE_NULL_POOL=true \
-e VESPA_HOST=index \
-e REDIS_HOST=cache \
-e API_SERVER_HOST=api_server \
-e OPENAI_API_KEY=${OPENAI_API_KEY} \
-e SLACK_BOT_TOKEN=${SLACK_BOT_TOKEN} \
-e CONFLUENCE_TEST_SPACE_URL=${CONFLUENCE_TEST_SPACE_URL} \
-e CONFLUENCE_USER_NAME=${CONFLUENCE_USER_NAME} \
-e CONFLUENCE_ACCESS_TOKEN=${CONFLUENCE_ACCESS_TOKEN} \
-e CONFLUENCE_ACCESS_TOKEN_SCOPED=${CONFLUENCE_ACCESS_TOKEN_SCOPED} \
-e JIRA_BASE_URL=${JIRA_BASE_URL} \
-e JIRA_USER_EMAIL=${JIRA_USER_EMAIL} \
-e JIRA_API_TOKEN=${JIRA_API_TOKEN} \
-e JIRA_API_TOKEN_SCOPED=${JIRA_API_TOKEN_SCOPED} \
-e PERM_SYNC_SHAREPOINT_CLIENT_ID=${PERM_SYNC_SHAREPOINT_CLIENT_ID} \
-e PERM_SYNC_SHAREPOINT_PRIVATE_KEY="${PERM_SYNC_SHAREPOINT_PRIVATE_KEY}" \
-e PERM_SYNC_SHAREPOINT_CERTIFICATE_PASSWORD=${PERM_SYNC_SHAREPOINT_CERTIFICATE_PASSWORD} \
-e PERM_SYNC_SHAREPOINT_DIRECTORY_ID=${PERM_SYNC_SHAREPOINT_DIRECTORY_ID} \
-e TEST_WEB_HOSTNAME=test-runner \
-e MOCK_CONNECTOR_SERVER_HOST=mock_connector_server \
-e MOCK_CONNECTOR_SERVER_PORT=8001 \
onyxdotapp/onyx-integration:test \
/app/tests/integration/${{ matrix.test-dir.path }}
# ------------------------------------------------------------
# Always gather logs BEFORE "down":
@@ -186,19 +325,19 @@ jobs:
if: always()
run: |
cd deployment/docker_compose
docker compose -f docker-compose.dev.yml -p onyx-stack logs --no-color api_server > $GITHUB_WORKSPACE/api_server.log || true
docker compose logs --no-color api_server > $GITHUB_WORKSPACE/api_server.log || true
- name: Dump all-container logs (optional)
if: always()
run: |
cd deployment/docker_compose
docker compose -f docker-compose.dev.yml -p onyx-stack logs --no-color > $GITHUB_WORKSPACE/docker-compose.log || true
docker compose logs --no-color > $GITHUB_WORKSPACE/docker-compose.log || true
- name: Upload logs
if: always()
uses: actions/upload-artifact@v4
with:
name: docker-all-logs
name: docker-all-logs-${{ matrix.test-dir.name }}
path: ${{ github.workspace }}/docker-compose.log
# ------------------------------------------------------------
@@ -206,4 +345,21 @@ jobs:
if: always()
run: |
cd deployment/docker_compose
docker compose -f docker-compose.dev.yml -p onyx-stack down -v
docker compose down -v
required:
runs-on: blacksmith-2vcpu-ubuntu-2404-arm
needs: [integration-tests-mit]
if: ${{ always() }}
steps:
- uses: actions/github-script@v7
with:
script: |
const needs = ${{ toJSON(needs) }};
const failed = Object.values(needs).some(n => n.result !== 'success');
if (failed) {
core.setFailed('One or more upstream jobs failed or were cancelled.');
} else {
core.notice('All required jobs succeeded.');
}

View File

@@ -6,43 +6,171 @@ concurrency:
on: push
env:
# AWS ECR Configuration
AWS_REGION: ${{ secrets.AWS_REGION || 'us-west-2' }}
ECR_REGISTRY: ${{ secrets.ECR_REGISTRY }}
AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID_ECR }}
AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY_ECR }}
BUILDX_NO_DEFAULT_ATTESTATIONS: 1
# Test Environment Variables
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
SLACK_BOT_TOKEN: ${{ secrets.SLACK_BOT_TOKEN }}
GEN_AI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
EXA_API_KEY: ${{ secrets.EXA_API_KEY }}
# for federated slack tests
SLACK_CLIENT_ID: ${{ secrets.SLACK_CLIENT_ID }}
SLACK_CLIENT_SECRET: ${{ secrets.SLACK_CLIENT_SECRET }}
MOCK_LLM_RESPONSE: true
jobs:
playwright-tests:
name: Playwright Tests
build-web-image:
runs-on: blacksmith-8vcpu-ubuntu-2404-arm
steps:
- name: Checkout code
uses: actions/checkout@v4
# See https://runs-on.com/runners/linux/
runs-on:
[
runs-on,
runner=32cpu-linux-x64,
disk=large,
"run-id=${{ github.run_id }}",
]
- name: Configure AWS credentials
uses: aws-actions/configure-aws-credentials@v4
with:
aws-access-key-id: ${{ env.AWS_ACCESS_KEY_ID }}
aws-secret-access-key: ${{ env.AWS_SECRET_ACCESS_KEY }}
aws-region: ${{ env.AWS_REGION }}
- name: Login to Amazon ECR
id: login-ecr
uses: aws-actions/amazon-ecr-login@v2
- name: Set up Docker Buildx
uses: useblacksmith/setup-docker-builder@v1
- name: Build and push Web Docker image
uses: useblacksmith/build-push-action@v2
with:
context: ./web
file: ./web/Dockerfile
platforms: linux/arm64
tags: ${{ env.ECR_REGISTRY }}/integration-test-onyx-web-server:playwright-test-${{ github.run_id }}
provenance: false
sbom: false
push: true
outputs: type=registry
no-cache: ${{ vars.DOCKER_NO_CACHE == 'true' }}
build-backend-image:
runs-on: blacksmith-8vcpu-ubuntu-2404-arm
steps:
- name: Checkout code
uses: actions/checkout@v4
- name: Configure AWS credentials
uses: aws-actions/configure-aws-credentials@v4
with:
aws-access-key-id: ${{ env.AWS_ACCESS_KEY_ID }}
aws-secret-access-key: ${{ env.AWS_SECRET_ACCESS_KEY }}
aws-region: ${{ env.AWS_REGION }}
- name: Login to Amazon ECR
id: login-ecr
uses: aws-actions/amazon-ecr-login@v2
- name: Set up Docker Buildx
uses: useblacksmith/setup-docker-builder@v1
- name: Build and push Backend Docker image
uses: useblacksmith/build-push-action@v2
with:
context: ./backend
file: ./backend/Dockerfile
platforms: linux/arm64
tags: ${{ env.ECR_REGISTRY }}/integration-test-onyx-backend:playwright-test-${{ github.run_id }}
provenance: false
sbom: false
push: true
outputs: type=registry
no-cache: ${{ vars.DOCKER_NO_CACHE == 'true' }}
build-model-server-image:
runs-on: blacksmith-8vcpu-ubuntu-2404-arm
steps:
- name: Checkout code
uses: actions/checkout@v4
- name: Configure AWS credentials
uses: aws-actions/configure-aws-credentials@v4
with:
aws-access-key-id: ${{ env.AWS_ACCESS_KEY_ID }}
aws-secret-access-key: ${{ env.AWS_SECRET_ACCESS_KEY }}
aws-region: ${{ env.AWS_REGION }}
- name: Login to Amazon ECR
id: login-ecr
uses: aws-actions/amazon-ecr-login@v2
- name: Set up Docker Buildx
uses: useblacksmith/setup-docker-builder@v1
- name: Build and push Model Server Docker image
uses: useblacksmith/build-push-action@v2
with:
context: ./backend
file: ./backend/Dockerfile.model_server
platforms: linux/arm64
tags: ${{ env.ECR_REGISTRY }}/integration-test-onyx-model-server:playwright-test-${{ github.run_id }}
provenance: false
sbom: false
push: true
outputs: type=registry
no-cache: ${{ vars.DOCKER_NO_CACHE == 'true' }}
playwright-tests:
needs: [build-web-image, build-backend-image, build-model-server-image]
name: Playwright Tests
runs-on: blacksmith-8vcpu-ubuntu-2404-arm
steps:
- name: Checkout code
uses: actions/checkout@v4
with:
fetch-depth: 0
- name: Set up Python
uses: actions/setup-python@v5
- name: Configure AWS credentials
uses: aws-actions/configure-aws-credentials@v4
with:
python-version: "3.11"
cache: "pip"
cache-dependency-path: |
backend/requirements/default.txt
backend/requirements/dev.txt
backend/requirements/model_server.txt
- run: |
python -m pip install --upgrade pip
pip install --retries 5 --timeout 30 -r backend/requirements/default.txt
pip install --retries 5 --timeout 30 -r backend/requirements/dev.txt
pip install --retries 5 --timeout 30 -r backend/requirements/model_server.txt
aws-access-key-id: ${{ env.AWS_ACCESS_KEY_ID }}
aws-secret-access-key: ${{ env.AWS_SECRET_ACCESS_KEY }}
aws-region: ${{ env.AWS_REGION }}
- name: Login to Amazon ECR
id: login-ecr
uses: aws-actions/amazon-ecr-login@v2
# needed for pulling Vespa, Redis, Postgres, and Minio images
# otherwise, we hit the "Unauthenticated users" limit
# https://docs.docker.com/docker-hub/usage/
- name: Login to Docker Hub
uses: docker/login-action@v3
with:
username: ${{ secrets.DOCKER_USERNAME }}
password: ${{ secrets.DOCKER_TOKEN }}
- name: Pull Docker images
run: |
# Pull all images from ECR in parallel
echo "Pulling Docker images in parallel..."
(docker pull ${{ env.ECR_REGISTRY }}/integration-test-onyx-web-server:playwright-test-${{ github.run_id }}) &
(docker pull ${{ env.ECR_REGISTRY }}/integration-test-onyx-backend:playwright-test-${{ github.run_id }}) &
(docker pull ${{ env.ECR_REGISTRY }}/integration-test-onyx-model-server:playwright-test-${{ github.run_id }}) &
# Wait for all background jobs to complete
wait
echo "All Docker images pulled successfully"
# Re-tag with expected names for docker-compose
docker tag ${{ env.ECR_REGISTRY }}/integration-test-onyx-web-server:playwright-test-${{ github.run_id }} onyxdotapp/onyx-web-server:test
docker tag ${{ env.ECR_REGISTRY }}/integration-test-onyx-backend:playwright-test-${{ github.run_id }} onyxdotapp/onyx-backend:test
docker tag ${{ env.ECR_REGISTRY }}/integration-test-onyx-model-server:playwright-test-${{ github.run_id }} onyxdotapp/onyx-model-server:test
- name: Setup node
uses: actions/setup-node@v4
@@ -57,79 +185,29 @@ jobs:
working-directory: ./web
run: npx playwright install --with-deps
- name: Set up Docker Buildx
uses: docker/setup-buildx-action@v3
- name: Login to Docker Hub
uses: docker/login-action@v3
with:
username: ${{ secrets.DOCKER_USERNAME }}
password: ${{ secrets.DOCKER_TOKEN }}
# tag every docker image with "test" so that we can spin up the correct set
# of images during testing
# we use the runs-on cache for docker builds
# in conjunction with runs-on runners, it has better speed and unlimited caching
# https://runs-on.com/caching/s3-cache-for-github-actions/
# https://runs-on.com/caching/docker/
# https://github.com/moby/buildkit#s3-cache-experimental
# images are built and run locally for testing purposes. Not pushed.
- name: Build Web Docker image
uses: ./.github/actions/custom-build-and-push
with:
context: ./web
file: ./web/Dockerfile
platforms: linux/amd64
tags: onyxdotapp/onyx-web-server:test
push: false
load: true
cache-from: type=s3,prefix=cache/${{ github.repository }}/integration-tests/web-server/,region=${{ env.RUNS_ON_AWS_REGION }},bucket=${{ env.RUNS_ON_S3_BUCKET_CACHE }}
cache-to: type=s3,prefix=cache/${{ github.repository }}/integration-tests/web-server/,region=${{ env.RUNS_ON_AWS_REGION }},bucket=${{ env.RUNS_ON_S3_BUCKET_CACHE }},mode=max
- name: Build Backend Docker image
uses: ./.github/actions/custom-build-and-push
with:
context: ./backend
file: ./backend/Dockerfile
platforms: linux/amd64
tags: onyxdotapp/onyx-backend:test
push: false
load: true
cache-from: type=s3,prefix=cache/${{ github.repository }}/integration-tests/backend/,region=${{ env.RUNS_ON_AWS_REGION }},bucket=${{ env.RUNS_ON_S3_BUCKET_CACHE }}
cache-to: type=s3,prefix=cache/${{ github.repository }}/integration-tests/backend/,region=${{ env.RUNS_ON_AWS_REGION }},bucket=${{ env.RUNS_ON_S3_BUCKET_CACHE }},mode=max
- name: Build Model Server Docker image
uses: ./.github/actions/custom-build-and-push
with:
context: ./backend
file: ./backend/Dockerfile.model_server
platforms: linux/amd64
tags: onyxdotapp/onyx-model-server:test
push: false
load: true
cache-from: type=s3,prefix=cache/${{ github.repository }}/integration-tests/model-server/,region=${{ env.RUNS_ON_AWS_REGION }},bucket=${{ env.RUNS_ON_S3_BUCKET_CACHE }}
cache-to: type=s3,prefix=cache/${{ github.repository }}/integration-tests/model-server/,region=${{ env.RUNS_ON_AWS_REGION }},bucket=${{ env.RUNS_ON_S3_BUCKET_CACHE }},mode=max
- name: Create .env file for Docker Compose
run: |
cat <<EOF > deployment/docker_compose/.env
ENABLE_PAID_ENTERPRISE_EDITION_FEATURES=true
AUTH_TYPE=basic
GEN_AI_API_KEY=${{ env.OPENAI_API_KEY }}
EXA_API_KEY=${{ env.EXA_API_KEY }}
REQUIRE_EMAIL_VERIFICATION=false
DISABLE_TELEMETRY=true
IMAGE_TAG=test
EOF
- name: Start Docker containers
run: |
cd deployment/docker_compose
ENABLE_PAID_ENTERPRISE_EDITION_FEATURES=true \
AUTH_TYPE=basic \
GEN_AI_API_KEY=${{ secrets.OPENAI_API_KEY }} \
REQUIRE_EMAIL_VERIFICATION=false \
DISABLE_TELEMETRY=true \
IMAGE_TAG=test \
docker compose -f docker-compose.dev.yml -p danswer-stack up -d
docker compose -f docker-compose.yml -f docker-compose.dev.yml up -d
id: start_docker
- name: Wait for service to be ready
run: |
echo "Starting wait-for-service script..."
docker logs -f danswer-stack-api_server-1 &
docker logs -f onyx-api_server-1 &
start_time=$(date +%s)
timeout=300 # 5 minutes in seconds
@@ -159,22 +237,18 @@ jobs:
done
echo "Finished waiting for service."
- name: Run pytest playwright test init
working-directory: ./backend
env:
PYTEST_IGNORE_SKIP: true
run: pytest -s tests/integration/tests/playwright/test_playwright.py
- name: Run Playwright tests
working-directory: ./web
run: npx playwright test
run: |
# Create test-results directory to ensure it exists for artifact upload
mkdir -p test-results
npx playwright test
- uses: actions/upload-artifact@v4
if: always()
with:
# Chromatic automatically defaults to the test-results directory.
# Replace with the path to your custom directory and adjust the CHROMATIC_ARCHIVE_LOCATION environment variable accordingly.
name: test-results
# Includes test results and debug screenshots
name: playwright-test-results-${{ github.run_id }}
path: ./web/test-results
retention-days: 30
@@ -183,7 +257,7 @@ jobs:
if: success() || failure()
run: |
cd deployment/docker_compose
docker compose -f docker-compose.dev.yml -p danswer-stack logs > docker-compose.log
docker compose logs > docker-compose.log
mv docker-compose.log ${{ github.workspace }}/docker-compose.log
- name: Upload logs
@@ -196,7 +270,7 @@ jobs:
- name: Stop Docker containers
run: |
cd deployment/docker_compose
docker compose -f docker-compose.dev.yml -p danswer-stack down -v
docker compose down -v
# NOTE: Chromatic UI diff testing is currently disabled.
# We are using Playwright for local and CI testing without visual regression checks.

View File

@@ -31,16 +31,31 @@ jobs:
pip install --retries 5 --timeout 30 -r backend/requirements/dev.txt
pip install --retries 5 --timeout 30 -r backend/requirements/model_server.txt
- name: Generate OpenAPI schema
working-directory: ./backend
env:
PYTHONPATH: "."
run: |
python scripts/onyx_openapi_schema.py --filename generated/openapi.json
- name: Generate OpenAPI Python client
working-directory: ./backend
run: |
docker run --rm \
-v "${{ github.workspace }}/backend/generated:/local" \
openapitools/openapi-generator-cli generate \
-i /local/openapi.json \
-g python \
-o /local/onyx_openapi_client \
--package-name onyx_openapi_client \
--skip-validate-spec \
--openapi-normalizer "SIMPLIFY_ONEOF_ANYOF=true,SET_OAS3_NULLABLE=true"
- name: Run MyPy
run: |
cd backend
mypy .
- name: Run ruff
run: |
cd backend
ruff .
- name: Check import order with reorder-python-imports
run: |
cd backend

View File

@@ -12,18 +12,31 @@ env:
# AWS
AWS_ACCESS_KEY_ID_DAILY_CONNECTOR_TESTS: ${{ secrets.AWS_ACCESS_KEY_ID_DAILY_CONNECTOR_TESTS }}
AWS_SECRET_ACCESS_KEY_DAILY_CONNECTOR_TESTS: ${{ secrets.AWS_SECRET_ACCESS_KEY_DAILY_CONNECTOR_TESTS }}
# Cloudflare R2
R2_ACCOUNT_ID_DAILY_CONNECTOR_TESTS: ${{ vars.R2_ACCOUNT_ID_DAILY_CONNECTOR_TESTS }}
R2_ACCESS_KEY_ID_DAILY_CONNECTOR_TESTS: ${{ secrets.R2_ACCESS_KEY_ID_DAILY_CONNECTOR_TESTS }}
R2_SECRET_ACCESS_KEY_DAILY_CONNECTOR_TESTS: ${{ secrets.R2_SECRET_ACCESS_KEY_DAILY_CONNECTOR_TESTS }}
# Google Cloud Storage
GCS_ACCESS_KEY_ID_DAILY_CONNECTOR_TESTS: ${{ secrets.GCS_ACCESS_KEY_ID_DAILY_CONNECTOR_TESTS }}
GCS_SECRET_ACCESS_KEY_DAILY_CONNECTOR_TESTS: ${{ secrets.GCS_SECRET_ACCESS_KEY_DAILY_CONNECTOR_TESTS }}
# Confluence
CONFLUENCE_TEST_SPACE_URL: ${{ secrets.CONFLUENCE_TEST_SPACE_URL }}
CONFLUENCE_TEST_SPACE: ${{ secrets.CONFLUENCE_TEST_SPACE }}
CONFLUENCE_IS_CLOUD: ${{ secrets.CONFLUENCE_IS_CLOUD }}
CONFLUENCE_TEST_SPACE_URL: ${{ vars.CONFLUENCE_TEST_SPACE_URL }}
CONFLUENCE_TEST_SPACE: ${{ vars.CONFLUENCE_TEST_SPACE }}
CONFLUENCE_TEST_PAGE_ID: ${{ secrets.CONFLUENCE_TEST_PAGE_ID }}
CONFLUENCE_USER_NAME: ${{ secrets.CONFLUENCE_USER_NAME }}
CONFLUENCE_USER_NAME: ${{ vars.CONFLUENCE_USER_NAME }}
CONFLUENCE_ACCESS_TOKEN: ${{ secrets.CONFLUENCE_ACCESS_TOKEN }}
CONFLUENCE_ACCESS_TOKEN_SCOPED: ${{ secrets.CONFLUENCE_ACCESS_TOKEN_SCOPED }}
# Jira
JIRA_BASE_URL: ${{ secrets.JIRA_BASE_URL }}
JIRA_USER_EMAIL: ${{ secrets.JIRA_USER_EMAIL }}
JIRA_API_TOKEN: ${{ secrets.JIRA_API_TOKEN }}
JIRA_API_TOKEN_SCOPED: ${{ secrets.JIRA_API_TOKEN_SCOPED }}
# Gong
GONG_ACCESS_KEY: ${{ secrets.GONG_ACCESS_KEY }}
GONG_ACCESS_KEY_SECRET: ${{ secrets.GONG_ACCESS_KEY_SECRET }}
@@ -33,37 +46,76 @@ env:
GOOGLE_DRIVE_OAUTH_CREDENTIALS_JSON_STR: ${{ secrets.GOOGLE_DRIVE_OAUTH_CREDENTIALS_JSON_STR }}
GOOGLE_GMAIL_SERVICE_ACCOUNT_JSON_STR: ${{ secrets.GOOGLE_GMAIL_SERVICE_ACCOUNT_JSON_STR }}
GOOGLE_GMAIL_OAUTH_CREDENTIALS_JSON_STR: ${{ secrets.GOOGLE_GMAIL_OAUTH_CREDENTIALS_JSON_STR }}
# Slab
SLAB_BOT_TOKEN: ${{ secrets.SLAB_BOT_TOKEN }}
# Zendesk
ZENDESK_SUBDOMAIN: ${{ secrets.ZENDESK_SUBDOMAIN }}
ZENDESK_EMAIL: ${{ secrets.ZENDESK_EMAIL }}
ZENDESK_TOKEN: ${{ secrets.ZENDESK_TOKEN }}
# Salesforce
SF_USERNAME: ${{ secrets.SF_USERNAME }}
SF_PASSWORD: ${{ secrets.SF_PASSWORD }}
SF_SECURITY_TOKEN: ${{ secrets.SF_SECURITY_TOKEN }}
# Hubspot
HUBSPOT_ACCESS_TOKEN: ${{ secrets.HUBSPOT_ACCESS_TOKEN }}
# IMAP
IMAP_HOST: ${{ vars.IMAP_HOST }}
IMAP_USERNAME: ${{ vars.IMAP_USERNAME }}
IMAP_PASSWORD: ${{ secrets.IMAP_PASSWORD }}
IMAP_MAILBOXES: ${{ vars.IMAP_MAILBOXES }}
# Airtable
AIRTABLE_TEST_BASE_ID: ${{ secrets.AIRTABLE_TEST_BASE_ID }}
AIRTABLE_TEST_TABLE_ID: ${{ secrets.AIRTABLE_TEST_TABLE_ID }}
AIRTABLE_TEST_TABLE_NAME: ${{ secrets.AIRTABLE_TEST_TABLE_NAME }}
AIRTABLE_TEST_BASE_ID: ${{ vars.AIRTABLE_TEST_BASE_ID }}
AIRTABLE_TEST_TABLE_ID: ${{ vars.AIRTABLE_TEST_TABLE_ID }}
AIRTABLE_TEST_TABLE_NAME: ${{ vars.AIRTABLE_TEST_TABLE_NAME }}
AIRTABLE_ACCESS_TOKEN: ${{ secrets.AIRTABLE_ACCESS_TOKEN }}
# Sharepoint
SHAREPOINT_CLIENT_ID: ${{ secrets.SHAREPOINT_CLIENT_ID }}
SHAREPOINT_CLIENT_ID: ${{ vars.SHAREPOINT_CLIENT_ID }}
SHAREPOINT_CLIENT_SECRET: ${{ secrets.SHAREPOINT_CLIENT_SECRET }}
SHAREPOINT_CLIENT_DIRECTORY_ID: ${{ secrets.SHAREPOINT_CLIENT_DIRECTORY_ID }}
SHAREPOINT_SITE: ${{ secrets.SHAREPOINT_SITE }}
SHAREPOINT_CLIENT_DIRECTORY_ID: ${{ vars.SHAREPOINT_CLIENT_DIRECTORY_ID }}
SHAREPOINT_SITE: ${{ vars.SHAREPOINT_SITE }}
# Github
ACCESS_TOKEN_GITHUB: ${{ secrets.ACCESS_TOKEN_GITHUB }}
# Gitlab
GITLAB_ACCESS_TOKEN: ${{ secrets.GITLAB_ACCESS_TOKEN }}
# Gitbook
GITBOOK_SPACE_ID: ${{ secrets.GITBOOK_SPACE_ID }}
GITBOOK_API_KEY: ${{ secrets.GITBOOK_API_KEY }}
# Notion
NOTION_INTEGRATION_TOKEN: ${{ secrets.NOTION_INTEGRATION_TOKEN }}
# Highspot
HIGHSPOT_KEY: ${{ secrets.HIGHSPOT_KEY }}
HIGHSPOT_SECRET: ${{ secrets.HIGHSPOT_SECRET }}
# Slack
SLACK_BOT_TOKEN: ${{ secrets.SLACK_BOT_TOKEN }}
# Teams
TEAMS_APPLICATION_ID: ${{ secrets.TEAMS_APPLICATION_ID }}
TEAMS_DIRECTORY_ID: ${{ secrets.TEAMS_DIRECTORY_ID }}
TEAMS_SECRET: ${{ secrets.TEAMS_SECRET }}
# Bitbucket
BITBUCKET_WORKSPACE: ${{ secrets.BITBUCKET_WORKSPACE }}
BITBUCKET_REPOSITORIES: ${{ secrets.BITBUCKET_REPOSITORIES }}
BITBUCKET_PROJECTS: ${{ secrets.BITBUCKET_PROJECTS }}
BITBUCKET_EMAIL: ${{ vars.BITBUCKET_EMAIL }}
BITBUCKET_API_TOKEN: ${{ secrets.BITBUCKET_API_TOKEN }}
# Fireflies
FIREFLIES_API_KEY: ${{ secrets.FIREFLIES_API_KEY }}
jobs:
connectors-check:
# See https://runs-on.com/runners/linux/
@@ -93,9 +145,76 @@ jobs:
playwright install chromium
playwright install-deps chromium
- name: Run Tests
- name: Detect Connector changes
id: changes
uses: dorny/paths-filter@v3
with:
filters: |
hubspot:
- 'backend/onyx/connectors/hubspot/**'
- 'backend/tests/daily/connectors/hubspot/**'
salesforce:
- 'backend/onyx/connectors/salesforce/**'
- 'backend/tests/daily/connectors/salesforce/**'
github:
- 'backend/onyx/connectors/github/**'
- 'backend/tests/daily/connectors/github/**'
file_processing:
- 'backend/onyx/file_processing/**'
- name: Run Tests (excluding HubSpot, Salesforce, and GitHub)
shell: script -q -e -c "bash --noprofile --norc -eo pipefail {0}"
run: py.test -o junit_family=xunit2 -xv --ff backend/tests/daily/connectors
run: |
py.test \
-n 8 \
--dist loadfile \
--durations=8 \
-o junit_family=xunit2 \
-xv \
--ff \
backend/tests/daily/connectors \
--ignore backend/tests/daily/connectors/hubspot \
--ignore backend/tests/daily/connectors/salesforce \
--ignore backend/tests/daily/connectors/github
- name: Run HubSpot Connector Tests
if: ${{ github.event_name == 'schedule' || steps.changes.outputs.hubspot == 'true' || steps.changes.outputs.file_processing == 'true' }}
shell: script -q -e -c "bash --noprofile --norc -eo pipefail {0}"
run: |
py.test \
-n 8 \
--dist loadfile \
--durations=8 \
-o junit_family=xunit2 \
-xv \
--ff \
backend/tests/daily/connectors/hubspot
- name: Run Salesforce Connector Tests
if: ${{ github.event_name == 'schedule' || steps.changes.outputs.salesforce == 'true' || steps.changes.outputs.file_processing == 'true' }}
shell: script -q -e -c "bash --noprofile --norc -eo pipefail {0}"
run: |
py.test \
-n 8 \
--dist loadfile \
--durations=8 \
-o junit_family=xunit2 \
-xv \
--ff \
backend/tests/daily/connectors/salesforce
- name: Run GitHub Connector Tests
if: ${{ github.event_name == 'schedule' || steps.changes.outputs.github == 'true' || steps.changes.outputs.file_processing == 'true' }}
shell: script -q -e -c "bash --noprofile --norc -eo pipefail {0}"
run: |
py.test \
-n 8 \
--dist loadfile \
--durations=8 \
-o junit_family=xunit2 \
-xv \
--ff \
backend/tests/daily/connectors/github
- name: Alert on Failure
if: failure() && github.event_name == 'schedule'

View File

@@ -15,7 +15,7 @@ env:
# Bedrock
AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
AWS_REGION_NAME: ${{ secrets.AWS_REGION_NAME }}
AWS_REGION_NAME: ${{ vars.AWS_REGION_NAME }}
# API keys for testing
COHERE_API_KEY: ${{ secrets.COHERE_API_KEY }}
@@ -23,7 +23,7 @@ env:
LITELLM_API_URL: ${{ secrets.LITELLM_API_URL }}
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
AZURE_API_KEY: ${{ secrets.AZURE_API_KEY }}
AZURE_API_URL: ${{ secrets.AZURE_API_URL }}
AZURE_API_URL: ${{ vars.AZURE_API_URL }}
jobs:
model-check:
@@ -77,7 +77,7 @@ jobs:
REQUIRE_EMAIL_VERIFICATION=false \
DISABLE_TELEMETRY=true \
IMAGE_TAG=test \
docker compose -f docker-compose.model-server-test.yml -p onyx-stack up -d indexing_model_server
docker compose -f docker-compose.model-server-test.yml up -d indexing_model_server
id: start_docker
- name: Wait for service to be ready
@@ -132,7 +132,7 @@ jobs:
if: always()
run: |
cd deployment/docker_compose
docker compose -f docker-compose.model-server-test.yml -p onyx-stack logs --no-color > $GITHUB_WORKSPACE/docker-compose.log || true
docker compose -f docker-compose.model-server-test.yml logs --no-color > $GITHUB_WORKSPACE/docker-compose.log || true
- name: Upload logs
if: always()
@@ -145,5 +145,5 @@ jobs:
if: always()
run: |
cd deployment/docker_compose
docker compose -f docker-compose.model-server-test.yml -p onyx-stack down -v
docker compose -f docker-compose.model-server-test.yml down -v

View File

@@ -15,6 +15,9 @@ jobs:
env:
PYTHONPATH: ./backend
REDIS_CLOUD_PYTEST_PASSWORD: ${{ secrets.REDIS_CLOUD_PYTEST_PASSWORD }}
SF_USERNAME: ${{ secrets.SF_USERNAME }}
SF_PASSWORD: ${{ secrets.SF_PASSWORD }}
SF_SECURITY_TOKEN: ${{ secrets.SF_SECURITY_TOKEN }}
steps:
- name: Checkout code
@@ -28,12 +31,14 @@ jobs:
cache-dependency-path: |
backend/requirements/default.txt
backend/requirements/dev.txt
backend/requirements/model_server.txt
- name: Install Dependencies
run: |
python -m pip install --upgrade pip
pip install --retries 5 --timeout 30 -r backend/requirements/default.txt
pip install --retries 5 --timeout 30 -r backend/requirements/dev.txt
pip install --retries 5 --timeout 30 -r backend/requirements/model_server.txt
- name: Run Tests
shell: script -q -e -c "bash --noprofile --norc -eo pipefail {0}"

View File

@@ -15,6 +15,9 @@ jobs:
# actions using GITHUB_TOKEN cannot trigger another workflow, but we do want this to trigger docker pushes
# see https://github.com/orgs/community/discussions/27028#discussioncomment-3254367 for the workaround we
# implement here which needs an actual user's deploy key
# Additional NOTE: even though this is named "rkuo", the actual key is tied to the onyx repo
# and not rkuo's personal account. It is fine to leave this key as is!
- name: Checkout code
uses: actions/checkout@v4
with:

17
.gitignore vendored
View File

@@ -14,12 +14,29 @@
/web/test-results/
backend/onyx/agent_search/main/test_data.json
backend/tests/regression/answer_quality/test_data.json
backend/tests/regression/search_quality/eval-*
backend/tests/regression/search_quality/search_eval_config.yaml
backend/tests/regression/search_quality/*.json
backend/onyx/evals/data/
*.log
# secret files
.env
jira_test_env
settings.json
# others
/deployment/data/nginx/app.conf
*.sw?
/backend/tests/regression/answer_quality/search_test_config.yaml
*.egg-info
# Local .terraform directories
**/.terraform/*
# Local .tfstate files
*.tfstate
*.tfstate.*
# Local .terraform.lock.hcl file
.terraform.lock.hcl

8
.mcp.json.template Normal file
View File

@@ -0,0 +1,8 @@
{
"mcpServers": {
"onyx-mcp": {
"type": "http",
"url": "http://localhost:8000/mcp"
}
}
}

View File

@@ -34,8 +34,25 @@ repos:
hooks:
- id: prettier
types_or: [html, css, javascript, ts, tsx]
additional_dependencies:
- prettier
language_version: system
- repo: https://github.com/sirwart/ripsecrets
rev: v0.1.11
hooks:
- id: ripsecrets
args:
- --additional-pattern
- ^sk-[A-Za-z0-9_\-]{20,}$
- repo: local
hooks:
- id: check-lazy-imports
name: Check lazy imports are not directly imported
entry: python3 backend/scripts/check_lazy_imports.py
language: system
files: ^backend/.*\.py$
pass_filenames: false
# We would like to have a mypy pre-commit hook, but due to the fact that
# pre-commit runs in it's own isolated environment, we would need to install

View File

@@ -10,7 +10,7 @@ SKIP_WARM_UP=True
# Always keep these on for Dev
# Logs all model prompts to stdout
LOG_DANSWER_MODEL_INTERACTIONS=True
LOG_ONYX_MODEL_INTERACTIONS=True
# More verbose logging
LOG_LEVEL=debug
@@ -23,6 +23,9 @@ DISABLE_LLM_DOC_RELEVANCE=False
# Useful if you want to toggle auth on/off (google_oauth/OIDC specifically)
OAUTH_CLIENT_ID=<REPLACE THIS>
OAUTH_CLIENT_SECRET=<REPLACE THIS>
OPENID_CONFIG_URL=<REPLACE THIS>
SAML_CONF_DIR=/<ABSOLUTE PATH TO ONYX>/onyx/backend/ee/onyx/configs/saml_config
# Generally not useful for dev, we don't generally want to set up an SMTP server for dev
REQUIRE_EMAIL_VERIFICATION=False
@@ -36,8 +39,8 @@ FAST_GEN_AI_MODEL_VERSION=gpt-4o
# For Danswer Slack Bot, overrides the UI values so no need to set this up via UI every time
# Only needed if using DanswerBot
#DANSWER_BOT_SLACK_APP_TOKEN=<REPLACE THIS>
#DANSWER_BOT_SLACK_BOT_TOKEN=<REPLACE THIS>
#ONYX_BOT_SLACK_APP_TOKEN=<REPLACE THIS>
#ONYX_BOT_SLACK_BOT_TOKEN=<REPLACE THIS>
# Python stuff
@@ -45,8 +48,8 @@ PYTHONPATH=../backend
PYTHONUNBUFFERED=1
# Internet Search
BING_API_KEY=<REPLACE THIS>
# Internet Search
EXA_API_KEY=<REPLACE THIS>
# Enable the full set of Danswer Enterprise Edition features
@@ -58,3 +61,18 @@ AGENT_RETRIEVAL_STATS=False # Note: This setting will incur substantial re-ran
AGENT_RERANKING_STATS=True
AGENT_MAX_QUERY_RETRIEVAL_RESULTS=20
AGENT_RERANKING_MAX_QUERY_RETRIEVAL_RESULTS=20
# S3 File Store Configuration (MinIO for local development)
S3_ENDPOINT_URL=http://localhost:9004
S3_FILE_STORE_BUCKET_NAME=onyx-file-store-bucket
S3_AWS_ACCESS_KEY_ID=minioadmin
S3_AWS_SECRET_ACCESS_KEY=minioadmin
# Show extra/uncommon connectors
SHOW_EXTRA_CONNECTORS=True
# Local langsmith tracing
LANGSMITH_TRACING="true"
LANGSMITH_ENDPOINT="https://api.smith.langchain.com"
LANGSMITH_API_KEY=<REPLACE_THIS>
LANGSMITH_PROJECT=<REPLACE_THIS>

File diff suppressed because it is too large Load Diff

101
.vscode/tasks.template.jsonc vendored Normal file
View File

@@ -0,0 +1,101 @@
{
"version": "2.0.0",
"tasks": [
{
"type": "austin",
"label": "Profile celery beat",
"envFile": "${workspaceFolder}/.env",
"options": {
"cwd": "${workspaceFolder}/backend"
},
"command": [
"sudo",
"-E"
],
"args": [
"celery",
"-A",
"onyx.background.celery.versioned_apps.beat",
"beat",
"--loglevel=INFO"
]
},
{
"type": "shell",
"label": "Generate Onyx OpenAPI Python client",
"cwd": "${workspaceFolder}/backend",
"envFile": "${workspaceFolder}/.env",
"options": {
"cwd": "${workspaceFolder}/backend"
},
"command": [
"openapi-generator"
],
"args": [
"generate",
"-i",
"generated/openapi.json",
"-g",
"python",
"-o",
"generated/onyx_openapi_client",
"--package-name",
"onyx_openapi_client",
]
},
{
"type": "shell",
"label": "Generate Typescript Fetch client (openapi-generator)",
"envFile": "${workspaceFolder}/.env",
"options": {
"cwd": "${workspaceFolder}"
},
"command": [
"openapi-generator"
],
"args": [
"generate",
"-i",
"backend/generated/openapi.json",
"-g",
"typescript-fetch",
"-o",
"${workspaceFolder}/web/src/lib/generated/onyx_api",
"--additional-properties=disallowAdditionalPropertiesIfNotPresent=false,legacyDiscriminatorBehavior=false,supportsES6=true",
]
},
{
"type": "shell",
"label": "Generate TypeScript Client (openapi-ts)",
"envFile": "${workspaceFolder}/.env",
"options": {
"cwd": "${workspaceFolder}/web"
},
"command": [
"npx"
],
"args": [
"openapi-typescript",
"../backend/generated/openapi.json",
"--output",
"./src/lib/generated/onyx-schema.ts",
]
},
{
"type": "shell",
"label": "Generate TypeScript Client (orval)",
"envFile": "${workspaceFolder}/.env",
"options": {
"cwd": "${workspaceFolder}/web"
},
"command": [
"npx"
],
"args": [
"orval",
"--config",
"orval.config.js",
]
}
]
}

325
AGENTS.md Normal file
View File

@@ -0,0 +1,325 @@
# AGENTS.md
This file provides guidance to Codex when working with code in this repository.
## KEY NOTES
- If you run into any missing python dependency errors, try running your command with `source backend/.venv/bin/activate` \
to assume the python venv.
- To make tests work, check the `.env` file at the root of the project to find an OpenAI key.
- If using `playwright` to explore the frontend, you can usually log in with username `a@test.com` and password
`a`. The app can be accessed at `http://localhost:3000`.
- You should assume that all Onyx services are running. To verify, you can check the `backend/log` directory to
make sure we see logs coming out from the relevant service.
- To connect to the Postgres database, use: `docker exec -it onyx-relational_db-1 psql -U postgres -c "<SQL>"`
- When making calls to the backend, always go through the frontend. E.g. make a call to `http://localhost:3000/api/persona` not `http://localhost:8080/api/persona`
- Put ALL db operations under the `backend/onyx/db` / `backend/ee/onyx/db` directories. Don't run queries
outside of those directories.
## Project Overview
**Onyx** (formerly Danswer) is an open-source Gen-AI and Enterprise Search platform that connects to company documents, apps, and people. It features a modular architecture with both Community Edition (MIT licensed) and Enterprise Edition offerings.
### Background Workers (Celery)
Onyx uses Celery for asynchronous task processing with multiple specialized workers:
#### Worker Types
1. **Primary Worker** (`celery_app.py`)
- Coordinates core background tasks and system-wide operations
- Handles connector management, document sync, pruning, and periodic checks
- Runs with 4 threads concurrency
- Tasks: connector deletion, vespa sync, pruning, LLM model updates, user file sync
2. **Docfetching Worker** (`docfetching`)
- Fetches documents from external data sources (connectors)
- Spawns docprocessing tasks for each document batch
- Implements watchdog monitoring for stuck connectors
- Configurable concurrency (default from env)
3. **Docprocessing Worker** (`docprocessing`)
- Processes fetched documents through the indexing pipeline:
- Upserts documents to PostgreSQL
- Chunks documents and adds contextual information
- Embeds chunks via model server
- Writes chunks to Vespa vector database
- Updates document metadata
- Configurable concurrency (default from env)
4. **Light Worker** (`light`)
- Handles lightweight, fast operations
- Tasks: vespa operations, document permissions sync, external group sync
- Higher concurrency for quick tasks
5. **Heavy Worker** (`heavy`)
- Handles resource-intensive operations
- Primary task: document pruning operations
- Runs with 4 threads concurrency
6. **KG Processing Worker** (`kg_processing`)
- Handles Knowledge Graph processing and clustering
- Builds relationships between documents
- Runs clustering algorithms
- Configurable concurrency
7. **Monitoring Worker** (`monitoring`)
- System health monitoring and metrics collection
- Monitors Celery queues, process memory, and system status
- Single thread (monitoring doesn't need parallelism)
- Cloud-specific monitoring tasks
8. **User File Processing Worker** (`user_file_processing`)
- Processes user-uploaded files
- Handles user file indexing and project synchronization
- Configurable concurrency
9. **Beat Worker** (`beat`)
- Celery's scheduler for periodic tasks
- Uses DynamicTenantScheduler for multi-tenant support
- Schedules tasks like:
- Indexing checks (every 15 seconds)
- Connector deletion checks (every 20 seconds)
- Vespa sync checks (every 20 seconds)
- Pruning checks (every 20 seconds)
- KG processing (every 60 seconds)
- Monitoring tasks (every 5 minutes)
- Cleanup tasks (hourly)
#### Worker Deployment Modes
Onyx supports two deployment modes for background workers, controlled by the `USE_LIGHTWEIGHT_BACKGROUND_WORKER` environment variable:
**Lightweight Mode** (default, `USE_LIGHTWEIGHT_BACKGROUND_WORKER=true`):
- Runs a single consolidated `background` worker that handles all background tasks:
- Pruning operations (from `heavy` worker)
- Knowledge graph processing (from `kg_processing` worker)
- Monitoring tasks (from `monitoring` worker)
- User file processing (from `user_file_processing` worker)
- Lower resource footprint (single worker process)
- Suitable for smaller deployments or development environments
- Default concurrency: 6 threads
**Standard Mode** (`USE_LIGHTWEIGHT_BACKGROUND_WORKER=false`):
- Runs separate specialized workers as documented above (heavy, kg_processing, monitoring, user_file_processing)
- Better isolation and scalability
- Can scale individual workers independently based on workload
- Suitable for production deployments with higher load
The deployment mode affects:
- **Backend**: Worker processes spawned by supervisord or dev scripts
- **Helm**: Which Kubernetes deployments are created
- **Dev Environment**: Which workers `dev_run_background_jobs.py` spawns
#### Key Features
- **Thread-based Workers**: All workers use thread pools (not processes) for stability
- **Tenant Awareness**: Multi-tenant support with per-tenant task isolation. There is a
middleware layer that automatically finds the appropriate tenant ID when sending tasks
via Celery Beat.
- **Task Prioritization**: High, Medium, Low priority queues
- **Monitoring**: Built-in heartbeat and liveness checking
- **Failure Handling**: Automatic retry and failure recovery mechanisms
- **Redis Coordination**: Inter-process communication via Redis
- **PostgreSQL State**: Task state and metadata stored in PostgreSQL
#### Important Notes
**Defining Tasks**:
- Always use `@shared_task` rather than `@celery_app`
- Put tasks under `background/celery/tasks/` or `ee/background/celery/tasks`
**Defining APIs**:
When creating new FastAPI APIs, do NOT use the `response_model` field. Instead, just type the
function.
**Testing Updates**:
If you make any updates to a celery worker and you want to test these changes, you will need
to ask me to restart the celery worker. There is no auto-restart on code-change mechanism.
### Code Quality
```bash
# Install and run pre-commit hooks
pre-commit install
pre-commit run --all-files
```
NOTE: Always make sure everything is strictly typed (both in Python and Typescript).
## Architecture Overview
### Technology Stack
- **Backend**: Python 3.11, FastAPI, SQLAlchemy, Alembic, Celery
- **Frontend**: Next.js 15+, React 18, TypeScript, Tailwind CSS
- **Database**: PostgreSQL with Redis caching
- **Search**: Vespa vector database
- **Auth**: OAuth2, SAML, multi-provider support
- **AI/ML**: LangChain, LiteLLM, multiple embedding models
### Directory Structure
```
backend/
├── onyx/
│ ├── auth/ # Authentication & authorization
│ ├── chat/ # Chat functionality & LLM interactions
│ ├── connectors/ # Data source connectors
│ ├── db/ # Database models & operations
│ ├── document_index/ # Vespa integration
│ ├── federated_connectors/ # External search connectors
│ ├── llm/ # LLM provider integrations
│ └── server/ # API endpoints & routers
├── ee/ # Enterprise Edition features
├── alembic/ # Database migrations
└── tests/ # Test suites
web/
├── src/app/ # Next.js app router pages
├── src/components/ # Reusable React components
└── src/lib/ # Utilities & business logic
```
## Database & Migrations
### Running Migrations
```bash
# Standard migrations
alembic upgrade head
# Multi-tenant (Enterprise)
alembic -n schema_private upgrade head
```
### Creating Migrations
```bash
# Auto-generate migration
alembic revision --autogenerate -m "description"
# Multi-tenant migration
alembic -n schema_private revision --autogenerate -m "description"
```
## Testing Strategy
There are 4 main types of tests within Onyx:
### Unit Tests
These should not assume any Onyx/external services are available to be called.
Interactions with the outside world should be mocked using `unittest.mock`. Generally, only
write these for complex, isolated modules e.g. `citation_processing.py`.
To run them:
```bash
python -m dotenv -f .vscode/.env run -- pytest -xv backend/tests/unit
```
### External Dependency Unit Tests
These tests assume that all external dependencies of Onyx are available and callable (e.g. Postgres, Redis,
MinIO/S3, Vespa are running + OpenAI can be called + any request to the internet is fine + etc.).
However, the actual Onyx containers are not running and with these tests we call the function to test directly.
We can also mock components/calls at will.
The goal with these tests are to minimize mocking while giving some flexibility to mock things that are flakey,
need strictly controlled behavior, or need to have their internal behavior validated (e.g. verify a function is called
with certain args, something that would be impossible with proper integration tests).
A great example of this type of test is `backend/tests/external_dependency_unit/connectors/confluence/test_confluence_group_sync.py`.
To run them:
```bash
python -m dotenv -f .vscode/.env run -- pytest backend/tests/external_dependency_unit
```
### Integration Tests
Standard integration tests. Every test in `backend/tests/integration` runs against a real Onyx deployment. We cannot
mock anything in these tests. Prefer writing integration tests (or External Dependency Unit Tests if mocking/internal
verification is necessary) over any other type of test.
Tests are parallelized at a directory level.
When writing integration tests, make sure to check the root `conftest.py` for useful fixtures + the `backend/tests/integration/common_utils` directory for utilities. Prefer (if one exists), calling the appropriate Manager
class in the utils over directly calling the APIs with a library like `requests`. Prefer using fixtures rather than
calling the utilities directly (e.g. do NOT create admin users with
`admin_user = UserManager.create(name="admin_user")`, instead use the `admin_user` fixture).
A great example of this type of test is `backend/tests/integration/dev_apis/test_simple_chat_api.py`.
To run them:
```bash
python -m dotenv -f .vscode/.env run -- pytest backend/tests/integration
```
### Playwright (E2E) Tests
These tests are an even more complete version of the Integration Tests mentioned above. Has all services of Onyx
running, *including* the Web Server.
Use these tests for anything that requires significant frontend <-> backend coordination.
Tests are located at `web/tests/e2e`. Tests are written in TypeScript.
To run them:
```bash
npx playwright test <TEST_NAME>
```
## Logs
When (1) writing integration tests or (2) doing live tests (e.g. curl / playwright) you can get access
to logs via the `backend/log/<service_name>_debug.log` file. All Onyx services (api_server, web_server, celery_X)
will be tailing their logs to this file.
## Security Considerations
- Never commit API keys or secrets to repository
- Use encrypted credential storage for connector credentials
- Follow RBAC patterns for new features
- Implement proper input validation with Pydantic models
- Use parameterized queries to prevent SQL injection
## AI/LLM Integration
- Multiple LLM providers supported via LiteLLM
- Configurable models per feature (chat, search, embeddings)
- Streaming support for real-time responses
- Token management and rate limiting
- Custom prompts and agent actions
## UI/UX Patterns
- Tailwind CSS with design system in `web/src/components/ui/`
- Radix UI and Headless UI for accessible components
- SWR for data fetching and caching
- Form validation with react-hook-form
- Error handling with popup notifications
## Creating a Plan
When creating a plan in the `plans` directory, make sure to include at least these elements:
**Issues to Address**
What the change is meant to do.
**Important Notes**
Things you come across in your research that are important to the implementation.
**Implementation strategy**
How you are going to make the changes happen. High level approach.
**Tests**
What unit (use rarely), external dependency unit, integration, and playwright tests you plan to write to
verify the correct behavior. Don't overtest. Usually, a given change only needs one type of test.
Do NOT include these: *Timeline*, *Rollback plan*
This is a minimal list - feel free to include more. Do NOT write code as part of your plan.
Keep it high level. You can reference certain files or functions though.
Before writing your plan, make sure to do research. Explore the relevant sections in the codebase.

328
CLAUDE.md Normal file
View File

@@ -0,0 +1,328 @@
# CLAUDE.md
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
## KEY NOTES
- If you run into any missing python dependency errors, try running your command with `source backend/.venv/bin/activate` \
to assume the python venv.
- To make tests work, check the `.env` file at the root of the project to find an OpenAI key.
- If using `playwright` to explore the frontend, you can usually log in with username `a@test.com` and password
`a`. The app can be accessed at `http://localhost:3000`.
- You should assume that all Onyx services are running. To verify, you can check the `backend/log` directory to
make sure we see logs coming out from the relevant service.
- To connect to the Postgres database, use: `docker exec -it onyx-relational_db-1 psql -U postgres -c "<SQL>"`
- When making calls to the backend, always go through the frontend. E.g. make a call to `http://localhost:3000/api/persona` not `http://localhost:8080/api/persona`
- Put ALL db operations under the `backend/onyx/db` / `backend/ee/onyx/db` directories. Don't run queries
outside of those directories.
## Project Overview
**Onyx** (formerly Danswer) is an open-source Gen-AI and Enterprise Search platform that connects to company documents, apps, and people. It features a modular architecture with both Community Edition (MIT licensed) and Enterprise Edition offerings.
### Background Workers (Celery)
Onyx uses Celery for asynchronous task processing with multiple specialized workers:
#### Worker Types
1. **Primary Worker** (`celery_app.py`)
- Coordinates core background tasks and system-wide operations
- Handles connector management, document sync, pruning, and periodic checks
- Runs with 4 threads concurrency
- Tasks: connector deletion, vespa sync, pruning, LLM model updates, user file sync
2. **Docfetching Worker** (`docfetching`)
- Fetches documents from external data sources (connectors)
- Spawns docprocessing tasks for each document batch
- Implements watchdog monitoring for stuck connectors
- Configurable concurrency (default from env)
3. **Docprocessing Worker** (`docprocessing`)
- Processes fetched documents through the indexing pipeline:
- Upserts documents to PostgreSQL
- Chunks documents and adds contextual information
- Embeds chunks via model server
- Writes chunks to Vespa vector database
- Updates document metadata
- Configurable concurrency (default from env)
4. **Light Worker** (`light`)
- Handles lightweight, fast operations
- Tasks: vespa operations, document permissions sync, external group sync
- Higher concurrency for quick tasks
5. **Heavy Worker** (`heavy`)
- Handles resource-intensive operations
- Primary task: document pruning operations
- Runs with 4 threads concurrency
6. **KG Processing Worker** (`kg_processing`)
- Handles Knowledge Graph processing and clustering
- Builds relationships between documents
- Runs clustering algorithms
- Configurable concurrency
7. **Monitoring Worker** (`monitoring`)
- System health monitoring and metrics collection
- Monitors Celery queues, process memory, and system status
- Single thread (monitoring doesn't need parallelism)
- Cloud-specific monitoring tasks
8. **User File Processing Worker** (`user_file_processing`)
- Processes user-uploaded files
- Handles user file indexing and project synchronization
- Configurable concurrency
9. **Beat Worker** (`beat`)
- Celery's scheduler for periodic tasks
- Uses DynamicTenantScheduler for multi-tenant support
- Schedules tasks like:
- Indexing checks (every 15 seconds)
- Connector deletion checks (every 20 seconds)
- Vespa sync checks (every 20 seconds)
- Pruning checks (every 20 seconds)
- KG processing (every 60 seconds)
- Monitoring tasks (every 5 minutes)
- Cleanup tasks (hourly)
#### Worker Deployment Modes
Onyx supports two deployment modes for background workers, controlled by the `USE_LIGHTWEIGHT_BACKGROUND_WORKER` environment variable:
**Lightweight Mode** (default, `USE_LIGHTWEIGHT_BACKGROUND_WORKER=true`):
- Runs a single consolidated `background` worker that handles all background tasks:
- Light worker tasks (Vespa operations, permissions sync, deletion)
- Document processing (indexing pipeline)
- Document fetching (connector data retrieval)
- Pruning operations (from `heavy` worker)
- Knowledge graph processing (from `kg_processing` worker)
- Monitoring tasks (from `monitoring` worker)
- User file processing (from `user_file_processing` worker)
- Lower resource footprint (fewer worker processes)
- Suitable for smaller deployments or development environments
- Default concurrency: 20 threads (increased to handle combined workload)
**Standard Mode** (`USE_LIGHTWEIGHT_BACKGROUND_WORKER=false`):
- Runs separate specialized workers as documented above (light, docprocessing, docfetching, heavy, kg_processing, monitoring, user_file_processing)
- Better isolation and scalability
- Can scale individual workers independently based on workload
- Suitable for production deployments with higher load
The deployment mode affects:
- **Backend**: Worker processes spawned by supervisord or dev scripts
- **Helm**: Which Kubernetes deployments are created
- **Dev Environment**: Which workers `dev_run_background_jobs.py` spawns
#### Key Features
- **Thread-based Workers**: All workers use thread pools (not processes) for stability
- **Tenant Awareness**: Multi-tenant support with per-tenant task isolation. There is a
middleware layer that automatically finds the appropriate tenant ID when sending tasks
via Celery Beat.
- **Task Prioritization**: High, Medium, Low priority queues
- **Monitoring**: Built-in heartbeat and liveness checking
- **Failure Handling**: Automatic retry and failure recovery mechanisms
- **Redis Coordination**: Inter-process communication via Redis
- **PostgreSQL State**: Task state and metadata stored in PostgreSQL
#### Important Notes
**Defining Tasks**:
- Always use `@shared_task` rather than `@celery_app`
- Put tasks under `background/celery/tasks/` or `ee/background/celery/tasks`
**Defining APIs**:
When creating new FastAPI APIs, do NOT use the `response_model` field. Instead, just type the
function.
**Testing Updates**:
If you make any updates to a celery worker and you want to test these changes, you will need
to ask me to restart the celery worker. There is no auto-restart on code-change mechanism.
### Code Quality
```bash
# Install and run pre-commit hooks
pre-commit install
pre-commit run --all-files
```
NOTE: Always make sure everything is strictly typed (both in Python and Typescript).
## Architecture Overview
### Technology Stack
- **Backend**: Python 3.11, FastAPI, SQLAlchemy, Alembic, Celery
- **Frontend**: Next.js 15+, React 18, TypeScript, Tailwind CSS
- **Database**: PostgreSQL with Redis caching
- **Search**: Vespa vector database
- **Auth**: OAuth2, SAML, multi-provider support
- **AI/ML**: LangChain, LiteLLM, multiple embedding models
### Directory Structure
```
backend/
├── onyx/
│ ├── auth/ # Authentication & authorization
│ ├── chat/ # Chat functionality & LLM interactions
│ ├── connectors/ # Data source connectors
│ ├── db/ # Database models & operations
│ ├── document_index/ # Vespa integration
│ ├── federated_connectors/ # External search connectors
│ ├── llm/ # LLM provider integrations
│ └── server/ # API endpoints & routers
├── ee/ # Enterprise Edition features
├── alembic/ # Database migrations
└── tests/ # Test suites
web/
├── src/app/ # Next.js app router pages
├── src/components/ # Reusable React components
└── src/lib/ # Utilities & business logic
```
## Database & Migrations
### Running Migrations
```bash
# Standard migrations
alembic upgrade head
# Multi-tenant (Enterprise)
alembic -n schema_private upgrade head
```
### Creating Migrations
```bash
# Auto-generate migration
alembic revision --autogenerate -m "description"
# Multi-tenant migration
alembic -n schema_private revision --autogenerate -m "description"
```
## Testing Strategy
There are 4 main types of tests within Onyx:
### Unit Tests
These should not assume any Onyx/external services are available to be called.
Interactions with the outside world should be mocked using `unittest.mock`. Generally, only
write these for complex, isolated modules e.g. `citation_processing.py`.
To run them:
```bash
python -m dotenv -f .vscode/.env run -- pytest -xv backend/tests/unit
```
### External Dependency Unit Tests
These tests assume that all external dependencies of Onyx are available and callable (e.g. Postgres, Redis,
MinIO/S3, Vespa are running + OpenAI can be called + any request to the internet is fine + etc.).
However, the actual Onyx containers are not running and with these tests we call the function to test directly.
We can also mock components/calls at will.
The goal with these tests are to minimize mocking while giving some flexibility to mock things that are flakey,
need strictly controlled behavior, or need to have their internal behavior validated (e.g. verify a function is called
with certain args, something that would be impossible with proper integration tests).
A great example of this type of test is `backend/tests/external_dependency_unit/connectors/confluence/test_confluence_group_sync.py`.
To run them:
```bash
python -m dotenv -f .vscode/.env run -- pytest backend/tests/external_dependency_unit
```
### Integration Tests
Standard integration tests. Every test in `backend/tests/integration` runs against a real Onyx deployment. We cannot
mock anything in these tests. Prefer writing integration tests (or External Dependency Unit Tests if mocking/internal
verification is necessary) over any other type of test.
Tests are parallelized at a directory level.
When writing integration tests, make sure to check the root `conftest.py` for useful fixtures + the `backend/tests/integration/common_utils` directory for utilities. Prefer (if one exists), calling the appropriate Manager
class in the utils over directly calling the APIs with a library like `requests`. Prefer using fixtures rather than
calling the utilities directly (e.g. do NOT create admin users with
`admin_user = UserManager.create(name="admin_user")`, instead use the `admin_user` fixture).
A great example of this type of test is `backend/tests/integration/dev_apis/test_simple_chat_api.py`.
To run them:
```bash
python -m dotenv -f .vscode/.env run -- pytest backend/tests/integration
```
### Playwright (E2E) Tests
These tests are an even more complete version of the Integration Tests mentioned above. Has all services of Onyx
running, *including* the Web Server.
Use these tests for anything that requires significant frontend <-> backend coordination.
Tests are located at `web/tests/e2e`. Tests are written in TypeScript.
To run them:
```bash
npx playwright test <TEST_NAME>
```
## Logs
When (1) writing integration tests or (2) doing live tests (e.g. curl / playwright) you can get access
to logs via the `backend/log/<service_name>_debug.log` file. All Onyx services (api_server, web_server, celery_X)
will be tailing their logs to this file.
## Security Considerations
- Never commit API keys or secrets to repository
- Use encrypted credential storage for connector credentials
- Follow RBAC patterns for new features
- Implement proper input validation with Pydantic models
- Use parameterized queries to prevent SQL injection
## AI/LLM Integration
- Multiple LLM providers supported via LiteLLM
- Configurable models per feature (chat, search, embeddings)
- Streaming support for real-time responses
- Token management and rate limiting
- Custom prompts and agent actions
## UI/UX Patterns
- Tailwind CSS with design system in `web/src/components/ui/`
- Radix UI and Headless UI for accessible components
- SWR for data fetching and caching
- Form validation with react-hook-form
- Error handling with popup notifications
## Creating a Plan
When creating a plan in the `plans` directory, make sure to include at least these elements:
**Issues to Address**
What the change is meant to do.
**Important Notes**
Things you come across in your research that are important to the implementation.
**Implementation strategy**
How you are going to make the changes happen. High level approach.
**Tests**
What unit (use rarely), external dependency unit, integration, and playwright tests you plan to write to
verify the correct behavior. Don't overtest. Usually, a given change only needs one type of test.
Do NOT include these: *Timeline*, *Rollback plan*
This is a minimal list - feel free to include more. Do NOT write code as part of your plan.
Keep it high level. You can reference certain files or functions though.
Before writing your plan, make sure to do research. Explore the relevant sections in the codebase.

View File

@@ -1,4 +1,4 @@
<!-- DANSWER_METADATA={"link": "https://github.com/onyx-dot-app/onyx/blob/main/CONTRIBUTING.md"} -->
<!-- ONYX_METADATA={"link": "https://github.com/onyx-dot-app/onyx/blob/main/CONTRIBUTING.md"} -->
# Contributing to Onyx
@@ -12,9 +12,8 @@ As an open source project in a rapidly changing space, we welcome all contributi
The [GitHub Issues](https://github.com/onyx-dot-app/onyx/issues) page is a great place to start for contribution ideas.
To ensure that your contribution is aligned with the project's direction, please reach out to Hagen (or any other maintainer) on the Onyx team
via [Slack](https://join.slack.com/t/onyx-dot-app/shared_invite/zt-2twesxdr6-5iQitKZQpgq~hYIZ~dv3KA) /
[Discord](https://discord.gg/TDJ59cGV2X) or [email](mailto:founders@onyx.app).
To ensure that your contribution is aligned with the project's direction, please reach out to any maintainer on the Onyx team
via [Discord](https://discord.gg/4NA5SbzrWb) or [email](mailto:hello@onyx.app).
Issues that have been explicitly approved by the maintainers (aligned with the direction of the project)
will be marked with the `approved by maintainers` label.
@@ -28,8 +27,7 @@ Your input is vital to making sure that Onyx moves in the right direction.
Before starting on implementation, please raise a GitHub issue.
Also, always feel free to message the founders (Chris Weaver / Yuhong Sun) on
[Slack](https://join.slack.com/t/onyx-dot-app/shared_invite/zt-2twesxdr6-5iQitKZQpgq~hYIZ~dv3KA) /
[Discord](https://discord.gg/TDJ59cGV2X) directly about anything at all.
[Discord](https://discord.gg/4NA5SbzrWb) directly about anything at all.
### Contributing Code
@@ -46,9 +44,7 @@ Our goal is to make contributing as easy as possible. If you run into any issues
That way we can help future contributors and users can avoid the same issue.
We also have support channels and generally interesting discussions on our
[Slack](https://join.slack.com/t/onyx-dot-app/shared_invite/zt-2twesxdr6-5iQitKZQpgq~hYIZ~dv3KA)
and
[Discord](https://discord.gg/TDJ59cGV2X).
[Discord](https://discord.gg/4NA5SbzrWb).
We would love to see you there!
@@ -59,6 +55,7 @@ Onyx being a fully functional app, relies on some external software, specificall
- [Postgres](https://www.postgresql.org/) (Relational DB)
- [Vespa](https://vespa.ai/) (Vector DB/Search Engine)
- [Redis](https://redis.io/) (Cache)
- [MinIO](https://min.io/) (File Store)
- [Nginx](https://nginx.org/) (Not needed for development flows generally)
> **Note:**
@@ -83,10 +80,6 @@ python -m venv .venv
source .venv/bin/activate
```
> **Note:**
> This virtual environment MUST NOT be set up WITHIN the onyx directory if you plan on using mypy within certain IDEs.
> For simplicity, we recommend setting up the virtual environment outside of the onyx directory.
_For Windows, activate the virtual environment using Command Prompt:_
```bash
@@ -102,10 +95,15 @@ If using PowerShell, the command slightly differs:
Install the required python dependencies:
```bash
pip install -r onyx/backend/requirements/default.txt
pip install -r onyx/backend/requirements/dev.txt
pip install -r onyx/backend/requirements/ee.txt
pip install -r onyx/backend/requirements/model_server.txt
pip install -r backend/requirements/default.txt
pip install -r backend/requirements/dev.txt
pip install -r backend/requirements/ee.txt
pip install -r backend/requirements/model_server.txt
```
Fix vscode/cursor auto-imports:
```bash
pip install -e .
```
Install Playwright for Python (headless browser required by the Web Connector)
@@ -120,8 +118,15 @@ You may have to deactivate and reactivate your virtualenv for `playwright` to ap
#### Frontend: Node dependencies
Install [Node.js and npm](https://docs.npmjs.com/downloading-and-installing-node-js-and-npm) for the frontend.
Once the above is done, navigate to `onyx/web` run:
Onyx uses Node v22.20.0. We highly recommend you use [Node Version Manager (nvm)](https://github.com/nvm-sh/nvm)
to manage your Node installations. Once installed, you can run
```bash
nvm install 22 && nvm use 22`
node -v # verify your active version
```
Navigate to `onyx/web` and run:
```bash
npm i
@@ -132,8 +137,6 @@ npm i
### Backend
For the backend, you'll need to setup pre-commit hooks (black / reorder-python-imports).
First, install pre-commit (if you don't have it already) following the instructions
[here](https://pre-commit.com/#installation).
With the virtual environment active, install the pre-commit library with:
@@ -153,15 +156,17 @@ To run the mypy checks manually, run `python -m mypy .` from the `onyx/backend`
### Web
We use `prettier` for formatting. The desired version (2.8.8) will be installed via a `npm i` from the `onyx/web` directory.
We use `prettier` for formatting. The desired version will be installed via a `npm i` from the `onyx/web` directory.
To run the formatter, use `npx prettier --write .` from the `onyx/web` directory.
Please double check that prettier passes before creating a pull request.
Pre-commit will also run prettier automatically on files you've recently touched. If re-formatted, your commit will fail.
Re-stage your changes and commit again.
# Running the application for development
## Developing using VSCode Debugger (recommended)
We highly recommend using VSCode debugger for development.
**We highly recommend using VSCode debugger for development.**
See [CONTRIBUTING_VSCODE.md](./CONTRIBUTING_VSCODE.md) for more details.
Otherwise, you can follow the instructions below to run the application for development.
@@ -171,10 +176,10 @@ Otherwise, you can follow the instructions below to run the application for deve
You will need Docker installed to run these containers.
First navigate to `onyx/deployment/docker_compose`, then start up Postgres/Vespa/Redis with:
First navigate to `onyx/deployment/docker_compose`, then start up Postgres/Vespa/Redis/MinIO with:
```bash
docker compose -f docker-compose.dev.yml -p onyx-stack up -d index relational_db cache
docker compose up -d index relational_db cache minio
```
(index refers to Vespa, relational_db refers to Postgres, and cache refers to Redis)
@@ -256,7 +261,7 @@ You can run the full Onyx application stack from pre-built images including all
Navigate to `onyx/deployment/docker_compose` and run:
```bash
docker compose -f docker-compose.dev.yml -p onyx-stack up -d
docker compose up -d
```
After Docker pulls and starts these containers, navigate to `http://localhost:3000` to use Onyx.
@@ -264,7 +269,7 @@ After Docker pulls and starts these containers, navigate to `http://localhost:30
If you want to make changes to Onyx and run those changes in Docker, you can also build a local version of the Onyx container images that incorporates your changes like so:
```bash
docker compose -f docker-compose.dev.yml -p onyx-stack up -d --build
docker compose up -d --build
```

View File

@@ -5,7 +5,7 @@ This guide explains how to set up and use VSCode's debugging capabilities with t
## Initial Setup
1. **Environment Setup**:
- Copy `.vscode/.env.template` to `.vscode/.env`
- Copy `.vscode/env_template.txt` to `.vscode/.env`
- Fill in the necessary environment variables in `.vscode/.env`
2. **launch.json**:
- Copy `.vscode/launch.template.jsonc` to `.vscode/launch.json`
@@ -17,10 +17,12 @@ Before starting, make sure the Docker Daemon is running.
1. Open the Debug view in VSCode (Cmd+Shift+D on macOS)
2. From the dropdown at the top, select "Clear and Restart External Volumes and Containers" and press the green play button
3. From the dropdown at the top, select "Run All Onyx Services" and press the green play button
4. CD into web, run "npm i" followed by npm run dev.
5. Now, you can navigate to onyx in your browser (default is http://localhost:3000) and start using the app
6. You can set breakpoints by clicking to the left of line numbers to help debug while the app is running
7. Use the debug toolbar to step through code, inspect variables, etc.
4. Now, you can navigate to onyx in your browser (default is http://localhost:3000) and start using the app
5. You can set breakpoints by clicking to the left of line numbers to help debug while the app is running
6. Use the debug toolbar to step through code, inspect variables, etc.
Note: Clear and Restart External Volumes and Containers will reset your postgres and Vespa (relational-db and index).
Only run this if you are okay with wiping your data.
## Features

134
README.md
View File

@@ -1,117 +1,103 @@
<!-- DANSWER_METADATA={"link": "https://github.com/onyx-dot-app/onyx/blob/main/README.md"} -->
<a name="readme-top"></a>
<h2 align="center">
<a href="https://www.onyx.app/"> <img width="50%" src="https://github.com/onyx-dot-app/onyx/blob/logo/OnyxLogoCropped.jpg?raw=true)" /></a>
<a href="https://www.onyx.app/"> <img width="50%" src="https://github.com/onyx-dot-app/onyx/blob/logo/OnyxLogoCropped.jpg?raw=true)" /></a>
</h2>
<p align="center">
<p align="center">Open Source Gen-AI + Enterprise Search.</p>
<p align="center">Open Source AI Platform</p>
<p align="center">
<a href="https://docs.onyx.app/" target="_blank">
<img src="https://img.shields.io/badge/docs-view-blue" alt="Documentation">
</a>
<a href="https://join.slack.com/t/onyx-dot-app/shared_invite/zt-2twesxdr6-5iQitKZQpgq~hYIZ~dv3KA" target="_blank">
<img src="https://img.shields.io/badge/slack-join-blue.svg?logo=slack" alt="Slack">
</a>
<a href="https://discord.gg/TDJ59cGV2X" target="_blank">
<img src="https://img.shields.io/badge/discord-join-blue.svg?logo=discord&logoColor=white" alt="Discord">
</a>
<a href="https://github.com/onyx-dot-app/onyx/blob/main/README.md" target="_blank">
<img src="https://img.shields.io/static/v1?label=license&message=MIT&color=blue" alt="License">
</a>
<a href="https://discord.gg/TDJ59cGV2X" target="_blank">
<img src="https://img.shields.io/badge/discord-join-blue.svg?logo=discord&logoColor=white" alt="Discord">
</a>
<a href="https://docs.onyx.app/" target="_blank">
<img src="https://img.shields.io/badge/docs-view-blue" alt="Documentation">
</a>
<a href="https://docs.onyx.app/" target="_blank">
<img src="https://img.shields.io/website?url=https://www.onyx.app&up_message=visit&up_color=blue" alt="Documentation">
</a>
<a href="https://github.com/onyx-dot-app/onyx/blob/main/LICENSE" target="_blank">
<img src="https://img.shields.io/static/v1?label=license&message=MIT&color=blue" alt="License">
</a>
</p>
<strong>[Onyx](https://www.onyx.app/)</strong> (formerly Danswer) is the AI platform connected to your company's docs, apps, and people.
Onyx provides a feature rich Chat interface and plugs into any LLM of your choice.
Keep knowledge and access controls sync-ed across over 40 connectors like Google Drive, Slack, Confluence, Salesforce, etc.
Create custom AI agents with unique prompts, knowledge, and actions that the agents can take.
Onyx can be deployed securely anywhere and for any scale - on a laptop, on-premise, or to cloud.
<h3>Feature Highlights</h3>
**[Onyx](https://www.onyx.app/)** is a feature-rich, self-hostable Chat UI that works with any LLM. It is easy to deploy and can run in a completely airgapped environment.
**Deep research over your team's knowledge:**
Onyx comes loaded with advanced features like Agents, Web Search, RAG, MCP, Deep Research, Connectors to 40+ knowledge sources, and more.
https://private-user-images.githubusercontent.com/32520769/414509312-48392e83-95d0-4fb5-8650-a396e05e0a32.mp4?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3Mzk5Mjg2MzYsIm5iZiI6MTczOTkyODMzNiwicGF0aCI6Ii8zMjUyMDc2OS80MTQ1MDkzMTItNDgzOTJlODMtOTVkMC00ZmI1LTg2NTAtYTM5NmUwNWUwYTMyLm1wND9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNTAyMTklMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjUwMjE5VDAxMjUzNlomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPWFhMzk5Njg2Y2Y5YjFmNDNiYTQ2YzM5ZTg5YWJiYTU2NWMyY2YwNmUyODE2NWUxMDRiMWQxZWJmODI4YTA0MTUmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0In0.a9D8A0sgKE9AoaoE-mfFbJ6_OKYeqaf7TZ4Han2JfW8
> [!TIP]
> Run Onyx with one command (or see deployment section below):
> ```
> curl -fsSL https://raw.githubusercontent.com/onyx-dot-app/onyx/main/deployment/docker_compose/install.sh > install.sh && chmod +x install.sh && ./install.sh
> ```
**Use Onyx as a secure AI Chat with any LLM:**
****
![Onyx Chat Silent Demo](https://github.com/onyx-dot-app/onyx/releases/download/v0.21.1/OnyxChatSilentDemo.gif)
**Easily set up connectors to your apps:**
![Onyx Connector Silent Demo](https://github.com/onyx-dot-app/onyx/releases/download/v0.21.1/OnyxConnectorSilentDemo.gif)
## ⭐ Features
- **🤖 Custom Agents:** Build AI Agents with unique instructions, knowledge and actions.
- **🌍 Web Search:** Browse the web with Google PSE, Exa, and Serper as well as an in-house scraper or Firecrawl.
- **🔍 RAG:** Best in class hybrid-search + knowledge graph for uploaded files and ingested documents from connectors.
- **🔄 Connectors:** Pull knowledge, metadata, and access information from over 40 applications.
- **🔬 Deep Research:** Get in depth answers with an agentic multi-step search.
- **▶️ Actions & MCP:** Give AI Agents the ability to interact with external systems.
- **💻 Code Interpreter:** Execute code to analyze data, render graphs and create files.
- **🎨 Image Generation:** Generate images based on user prompts.
- **👥 Collaboration:** Chat sharing, feedback gathering, user management, usage analytics, and more.
Onyx works with all LLMs (like OpenAI, Anthropic, Gemini, etc.) and self-hosted LLMs (like Ollama, vLLM, etc.)
To learn more about the features, check out our [documentation](https://docs.onyx.app/welcome)!
**Access Onyx where your team already works:**
![Onyx Bot Demo](https://github.com/onyx-dot-app/onyx/releases/download/v0.21.1/OnyxBot.png)
## 🚀 Deployment
Onyx supports deployments in Docker, Kubernetes, Terraform, along with guides for major cloud providers.
See guides below:
- [Docker](https://docs.onyx.app/deployment/local/docker) or [Quickstart](https://docs.onyx.app/deployment/getting_started/quickstart) (best for most users)
- [Kubernetes](https://docs.onyx.app/deployment/local/kubernetes) (best for large teams)
- [Terraform](https://docs.onyx.app/deployment/local/terraform) (best for teams already using Terraform)
- Cloud specific guides (best if specifically using [AWS EKS](https://docs.onyx.app/deployment/cloud/aws/eks), [Azure VMs](https://docs.onyx.app/deployment/cloud/azure), etc.)
> [!TIP]
> **To try Onyx for free without deploying, check out [Onyx Cloud](https://cloud.onyx.app/signup)**.
## Deployment
**To try it out for free and get started in seconds, check out [Onyx Cloud](https://cloud.onyx.app/signup)**.
Onyx can also be run locally (even on a laptop) or deployed on a virtual machine with a single
`docker compose` command. Checkout our [docs](https://docs.onyx.app/quickstart) to learn more.
## 🔍 Other Notable Benefits
Onyx is built for teams of all sizes, from individual users to the largest global enterprises.
We also have built-in support for high-availability/scalable deployment on Kubernetes.
References [here](https://github.com/onyx-dot-app/onyx/tree/main/deployment).
- **Enterprise Search**: far more than simple RAG, Onyx has custom indexing and retrieval that remains performant and accurate for scales of up to tens of millions of documents.
- **Security**: SSO (OIDC/SAML/OAuth2), RBAC, encryption of credentials, etc.
- **Management UI**: different user roles such as basic, curator, and admin.
- **Document Permissioning**: mirrors user access from external apps for RAG use cases.
## 🔍 Other Notable Benefits of Onyx
- Custom deep learning models for indexing and inference time, only through Onyx + learning from user feedback.
- Flexible security features like SSO (OIDC/SAML/OAuth2), RBAC, encryption of credentials, etc.
- Knowledge curation features like document-sets, query history, usage analytics, etc.
- Scalable deployment options tested up to many tens of thousands users and hundreds of millions of documents.
## 🚧 Roadmap
- New methods in information retrieval (StructRAG, LightGraphRAG, etc.)
- Personalized Search
- Organizational understanding and ability to locate and suggest experts from your team.
- Code Search
- SQL and Structured Query Language
To see ongoing and upcoming projects, check out our [roadmap](https://github.com/orgs/onyx-dot-app/projects/2)!
## 🔌 Connectors
Keep knowledge and access up to sync across 40+ connectors:
- Google Drive
- Confluence
- Slack
- Gmail
- Salesforce
- Microsoft Sharepoint
- Github
- Jira
- Zendesk
- Gong
- Microsoft Teams
- Dropbox
- Local Files
- Websites
- And more ...
See the full list [here](https://docs.onyx.app/connectors).
## 📚 Licensing
There are two editions of Onyx:
- Onyx Community Edition (CE) is available freely under the MIT Expat license. Simply follow the Deployment guide above.
- Onyx Community Edition (CE) is available freely under the MIT license.
- Onyx Enterprise Edition (EE) includes extra features that are primarily useful for larger organizations.
For feature details, check out [our website](https://www.onyx.app/pricing).
To try the Onyx Enterprise Edition:
1. Checkout [Onyx Cloud](https://cloud.onyx.app/signup).
2. For self-hosting the Enterprise Edition, contact us at [founders@onyx.app](mailto:founders@onyx.app) or book a call with us on our [Cal](https://cal.com/team/onyx/founders).
## 👪 Community
Join our open source community on **[Discord](https://discord.gg/TDJ59cGV2X)**!
## 💡 Contributing
Looking to contribute? Please check out the [Contribution Guide](CONTRIBUTING.md) for more details.

4
backend/.gitignore vendored
View File

@@ -9,4 +9,6 @@ api_keys.py
vespa-app.zip
dynamic_config_storage/
celerybeat-schedule*
onyx/connectors/salesforce/data/
onyx/connectors/salesforce/data/
.test.env
/generated

View File

@@ -12,10 +12,11 @@ ARG ONYX_VERSION=0.0.0-dev
# DO_NOT_TRACK is used to disable telemetry for Unstructured
ENV ONYX_VERSION=${ONYX_VERSION} \
DANSWER_RUNNING_IN_DOCKER="true" \
DO_NOT_TRACK="true"
DO_NOT_TRACK="true" \
PLAYWRIGHT_BROWSERS_PATH="/app/.cache/ms-playwright"
COPY --from=ghcr.io/astral-sh/uv:latest /uv /uvx /bin/
RUN echo "ONYX_VERSION: ${ONYX_VERSION}"
# Install system dependencies
# cmake needed for psycopg (postgres)
# libpq-dev needed for psycopg (postgres)
@@ -47,22 +48,19 @@ RUN apt-get update && \
# Remove py which is pulled in by retry, py is not needed and is a CVE
COPY ./requirements/default.txt /tmp/requirements.txt
COPY ./requirements/ee.txt /tmp/ee-requirements.txt
RUN pip install --no-cache-dir --upgrade \
--retries 5 \
--timeout 30 \
RUN uv pip install --system --no-cache-dir --upgrade \
-r /tmp/requirements.txt \
-r /tmp/ee-requirements.txt && \
pip uninstall -y py && \
playwright install chromium && \
playwright install-deps chromium && \
ln -s /usr/local/bin/supervisord /usr/bin/supervisord
# Cleanup for CVEs and size reduction
# https://github.com/tornadoweb/tornado/issues/3107
# xserver-common and xvfb included by playwright installation but not needed after
# perl-base is part of the base Python Debian image but not needed for Onyx functionality
# perl-base could only be removed with --allow-remove-essential
RUN apt-get update && \
ln -s /usr/local/bin/supervisord /usr/bin/supervisord && \
# Cleanup for CVEs and size reduction
# https://github.com/tornadoweb/tornado/issues/3107
# xserver-common and xvfb included by playwright installation but not needed after
# perl-base is part of the base Python Debian image but not needed for Onyx functionality
# perl-base could only be removed with --allow-remove-essential
apt-get update && \
apt-get remove -y --allow-remove-essential \
perl-base \
xserver-common \
@@ -72,12 +70,16 @@ RUN apt-get update && \
libxmlsec1-dev \
pkg-config \
gcc && \
apt-get install -y libxmlsec1-openssl && \
# Install here to avoid some packages being cleaned up above
apt-get install -y \
libxmlsec1-openssl \
# Install postgresql-client for easy manual tests
postgresql-client && \
apt-get autoremove -y && \
rm -rf /var/lib/apt/lists/* && \
rm -rf ~/.cache/uv /tmp/*.txt && \
rm -f /usr/local/lib/python3.11/site-packages/tornado/test/test.key
# Pre-downloading models for setups with limited egress
RUN python -c "from tokenizers import Tokenizer; \
Tokenizer.from_pretrained('nomic-ai/nomic-embed-text-v1')"
@@ -85,31 +87,40 @@ Tokenizer.from_pretrained('nomic-ai/nomic-embed-text-v1')"
# Pre-downloading NLTK for setups with limited egress
RUN python -c "import nltk; \
nltk.download('stopwords', quiet=True); \
nltk.download('punkt', quiet=True);"
nltk.download('punkt_tab', quiet=True);"
# nltk.download('wordnet', quiet=True); introduce this back if lemmatization is needed
# Set up application files
WORKDIR /app
# Create non-root user for security best practices
RUN groupadd -g 1001 onyx && \
useradd -u 1001 -g onyx -m -s /bin/bash onyx && \
mkdir -p /var/log/onyx && \
chmod 755 /var/log/onyx && \
chown onyx:onyx /var/log/onyx
# Enterprise Version Files
COPY ./ee /app/ee
COPY --chown=onyx:onyx ./ee /app/ee
COPY supervisord.conf /etc/supervisor/conf.d/supervisord.conf
# Set up application files
COPY ./onyx /app/onyx
COPY ./shared_configs /app/shared_configs
COPY ./alembic /app/alembic
COPY ./alembic_tenants /app/alembic_tenants
COPY ./alembic.ini /app/alembic.ini
COPY --chown=onyx:onyx ./onyx /app/onyx
COPY --chown=onyx:onyx ./shared_configs /app/shared_configs
COPY --chown=onyx:onyx ./alembic /app/alembic
COPY --chown=onyx:onyx ./alembic_tenants /app/alembic_tenants
COPY --chown=onyx:onyx ./alembic.ini /app/alembic.ini
COPY supervisord.conf /usr/etc/supervisord.conf
COPY ./static /app/static
COPY --chown=onyx:onyx ./static /app/static
# Escape hatch scripts
COPY ./scripts/debugging /app/scripts/debugging
COPY ./scripts/force_delete_connector_by_id.py /app/scripts/force_delete_connector_by_id.py
COPY --chown=onyx:onyx ./scripts/debugging /app/scripts/debugging
COPY --chown=onyx:onyx ./scripts/force_delete_connector_by_id.py /app/scripts/force_delete_connector_by_id.py
COPY --chown=onyx:onyx ./scripts/supervisord_entrypoint.sh /app/scripts/supervisord_entrypoint.sh
RUN chmod +x /app/scripts/supervisord_entrypoint.sh
# Put logo in assets
COPY ./assets /app/assets
COPY --chown=onyx:onyx ./assets /app/assets
ENV PYTHONPATH=/app

View File

@@ -9,19 +9,42 @@ visit https://github.com/onyx-dot-app/onyx."
# Default ONYX_VERSION, typically overriden during builds by GitHub Actions.
ARG ONYX_VERSION=0.0.0-dev
ENV ONYX_VERSION=${ONYX_VERSION} \
DANSWER_RUNNING_IN_DOCKER="true"
DANSWER_RUNNING_IN_DOCKER="true" \
HF_HOME=/app/.cache/huggingface
COPY --from=ghcr.io/astral-sh/uv:latest /uv /uvx /bin/
RUN echo "ONYX_VERSION: ${ONYX_VERSION}"
# Create non-root user for security best practices
RUN mkdir -p /app && \
groupadd -g 1001 onyx && \
useradd -u 1001 -g onyx -m -s /bin/bash onyx && \
chown -R onyx:onyx /app && \
mkdir -p /var/log/onyx && \
chmod 755 /var/log/onyx && \
chown onyx:onyx /var/log/onyx
# --- add toolchain needed for Rust/Python builds (fastuuid) ---
ENV RUSTUP_HOME=/usr/local/rustup \
CARGO_HOME=/usr/local/cargo \
PATH=/usr/local/cargo/bin:$PATH
RUN set -eux; \
apt-get update && apt-get install -y --no-install-recommends \
build-essential \
pkg-config \
curl \
ca-certificates \
# Install latest stable Rust (supports Cargo.lock v4)
&& curl -sSf https://sh.rustup.rs | sh -s -- -y --profile minimal --default-toolchain stable \
&& rustc --version && cargo --version \
&& apt-get remove -y --allow-remove-essential perl-base \
&& apt-get autoremove -y \
&& rm -rf /var/lib/apt/lists/*
COPY ./requirements/model_server.txt /tmp/requirements.txt
RUN pip install --no-cache-dir --upgrade \
--retries 5 \
--timeout 30 \
-r /tmp/requirements.txt
RUN apt-get remove -y --allow-remove-essential perl-base && \
apt-get autoremove -y
RUN uv pip install --system --no-cache-dir --upgrade \
-r /tmp/requirements.txt && \
rm -rf ~/.cache/uv /tmp/*.txt
# Pre-downloading models for setups with limited egress
# Download tokenizers, distilbert for the Onyx model
@@ -36,11 +59,12 @@ snapshot_download(repo_id='onyx-dot-app/information-content-model'); \
snapshot_download('nomic-ai/nomic-embed-text-v1'); \
snapshot_download('mixedbread-ai/mxbai-rerank-xsmall-v1'); \
from sentence_transformers import SentenceTransformer; \
SentenceTransformer(model_name_or_path='nomic-ai/nomic-embed-text-v1', trust_remote_code=True);"
# In case the user has volumes mounted to /root/.cache/huggingface that they've downloaded while
# running Onyx, don't overwrite it with the built in cache folder
RUN mv /root/.cache/huggingface /root/.cache/temp_huggingface
SentenceTransformer(model_name_or_path='nomic-ai/nomic-embed-text-v1', trust_remote_code=True);" && \
# In case the user has volumes mounted to /app/.cache/huggingface that they've downloaded while
# running Onyx, move the current contents of the cache folder to a temporary location to ensure
# it's preserved in order to combine with the user's cache contents
mv /app/.cache/huggingface /app/.cache/temp_huggingface && \
chown -R onyx:onyx /app
WORKDIR /app

View File

@@ -1,4 +1,4 @@
<!-- DANSWER_METADATA={"link": "https://github.com/onyx-dot-app/onyx/blob/main/backend/alembic/README.md"} -->
<!-- ONYX_METADATA={"link": "https://github.com/onyx-dot-app/onyx/blob/main/backend/alembic/README.md"} -->
# Alembic DB Migrations
@@ -20,3 +20,44 @@ To run all un-applied migrations:
To undo migrations:
`alembic downgrade -X`
where X is the number of migrations you want to undo from the current state
### Multi-tenant migrations
For multi-tenant deployments, you can use additional options:
**Upgrade all tenants:**
```bash
alembic -x upgrade_all_tenants=true upgrade head
```
**Upgrade specific schemas:**
```bash
# Single schema
alembic -x schemas=tenant_12345678-1234-1234-1234-123456789012 upgrade head
# Multiple schemas (comma-separated)
alembic -x schemas=tenant_12345678-1234-1234-1234-123456789012,public,another_tenant upgrade head
```
**Upgrade tenants within an alphabetical range:**
```bash
# Upgrade tenants 100-200 when sorted alphabetically (positions 100 to 200)
alembic -x upgrade_all_tenants=true -x tenant_range_start=100 -x tenant_range_end=200 upgrade head
# Upgrade tenants starting from position 1000 alphabetically
alembic -x upgrade_all_tenants=true -x tenant_range_start=1000 upgrade head
# Upgrade first 500 tenants alphabetically
alembic -x upgrade_all_tenants=true -x tenant_range_end=500 upgrade head
```
**Continue on error (for batch operations):**
```bash
alembic -x upgrade_all_tenants=true -x continue=true upgrade head
```
The tenant range filtering works by:
1. Sorting tenant IDs alphabetically
2. Using 1-based position numbers (1st, 2nd, 3rd tenant, etc.)
3. Filtering to the specified range of positions
4. Non-tenant schemas (like 'public') are always included

View File

@@ -1,12 +1,12 @@
from typing import Any, Literal
from onyx.db.engine import get_iam_auth_token
from onyx.db.engine.iam_auth import get_iam_auth_token
from onyx.configs.app_configs import USE_IAM_AUTH
from onyx.configs.app_configs import POSTGRES_HOST
from onyx.configs.app_configs import POSTGRES_PORT
from onyx.configs.app_configs import POSTGRES_USER
from onyx.configs.app_configs import AWS_REGION_NAME
from onyx.db.engine import build_connection_string
from onyx.db.engine import get_all_tenant_ids
from onyx.db.engine.sql_engine import build_connection_string
from onyx.db.engine.tenant_utils import get_all_tenant_ids
from sqlalchemy import event
from sqlalchemy import pool
from sqlalchemy import text
@@ -21,9 +21,14 @@ from alembic import context
from sqlalchemy.ext.asyncio import create_async_engine
from sqlalchemy.sql.schema import SchemaItem
from onyx.configs.constants import SSL_CERT_FILE
from shared_configs.configs import MULTI_TENANT, POSTGRES_DEFAULT_SCHEMA
from shared_configs.configs import (
MULTI_TENANT,
POSTGRES_DEFAULT_SCHEMA,
TENANT_ID_PREFIX,
)
from onyx.db.models import Base
from celery.backends.database.session import ResultModelBase # type: ignore
from onyx.db.engine.sql_engine import SqlEngine
# Make sure in alembic.ini [logger_root] level=INFO is set or most logging will be
# hidden! (defaults to level=WARN)
@@ -68,15 +73,67 @@ def include_object(
return True
def get_schema_options() -> tuple[str, bool, bool, bool]:
def filter_tenants_by_range(
tenant_ids: list[str], start_range: int | None = None, end_range: int | None = None
) -> list[str]:
"""
Filter tenant IDs by alphabetical position range.
Args:
tenant_ids: List of tenant IDs to filter
start_range: Starting position in alphabetically sorted list (1-based, inclusive)
end_range: Ending position in alphabetically sorted list (1-based, inclusive)
Returns:
Filtered list of tenant IDs in their original order
"""
if start_range is None and end_range is None:
return tenant_ids
# Separate tenant IDs from non-tenant schemas
tenant_schemas = [tid for tid in tenant_ids if tid.startswith(TENANT_ID_PREFIX)]
non_tenant_schemas = [
tid for tid in tenant_ids if not tid.startswith(TENANT_ID_PREFIX)
]
# Sort tenant schemas alphabetically.
# NOTE: can cause missed schemas if a schema is created in between workers
# fetching of all tenant IDs. We accept this risk for now. Just re-running
# the migration will fix the issue.
sorted_tenant_schemas = sorted(tenant_schemas)
# Apply range filtering (0-based indexing)
start_idx = start_range if start_range is not None else 0
end_idx = end_range if end_range is not None else len(sorted_tenant_schemas)
# Ensure indices are within bounds
start_idx = max(0, start_idx)
end_idx = min(len(sorted_tenant_schemas), end_idx)
# Get the filtered tenant schemas
filtered_tenant_schemas = sorted_tenant_schemas[start_idx:end_idx]
# Combine with non-tenant schemas and preserve original order
filtered_tenants = []
for tenant_id in tenant_ids:
if tenant_id in filtered_tenant_schemas or tenant_id in non_tenant_schemas:
filtered_tenants.append(tenant_id)
return filtered_tenants
def get_schema_options() -> (
tuple[bool, bool, bool, int | None, int | None, list[str] | None]
):
x_args_raw = context.get_x_argument()
x_args = {}
for arg in x_args_raw:
for pair in arg.split(","):
if "=" in pair:
key, value = pair.split("=", 1)
x_args[key.strip()] = value.strip()
schema_name = x_args.get("schema", POSTGRES_DEFAULT_SCHEMA)
if "=" in arg:
key, value = arg.split("=", 1)
x_args[key.strip()] = value.strip()
else:
raise ValueError(f"Invalid argument: {arg}")
create_schema = x_args.get("create_schema", "true").lower() == "true"
upgrade_all_tenants = x_args.get("upgrade_all_tenants", "false").lower() == "true"
@@ -84,17 +141,81 @@ def get_schema_options() -> tuple[str, bool, bool, bool]:
# only applies to online migrations
continue_on_error = x_args.get("continue", "false").lower() == "true"
if (
MULTI_TENANT
and schema_name == POSTGRES_DEFAULT_SCHEMA
and not upgrade_all_tenants
):
# Tenant range filtering
tenant_range_start = None
tenant_range_end = None
if "tenant_range_start" in x_args:
try:
tenant_range_start = int(x_args["tenant_range_start"])
except ValueError:
raise ValueError(
f"Invalid tenant_range_start value: {x_args['tenant_range_start']}. Must be an integer."
)
if "tenant_range_end" in x_args:
try:
tenant_range_end = int(x_args["tenant_range_end"])
except ValueError:
raise ValueError(
f"Invalid tenant_range_end value: {x_args['tenant_range_end']}. Must be an integer."
)
# Validate range
if tenant_range_start is not None and tenant_range_end is not None:
if tenant_range_start > tenant_range_end:
raise ValueError(
f"tenant_range_start ({tenant_range_start}) cannot be greater than tenant_range_end ({tenant_range_end})"
)
# Specific schema names filtering (replaces both schema_name and the old tenant_ids approach)
schemas = None
if "schemas" in x_args:
schema_names_str = x_args["schemas"].strip()
if schema_names_str:
# Split by comma and strip whitespace
schemas = [
name.strip() for name in schema_names_str.split(",") if name.strip()
]
if schemas:
logger.info(f"Specific schema names specified: {schemas}")
# Validate that only one method is used at a time
range_filtering = tenant_range_start is not None or tenant_range_end is not None
specific_filtering = schemas is not None and len(schemas) > 0
if range_filtering and specific_filtering:
raise ValueError(
"Cannot run default migrations in public schema when multi-tenancy is enabled. "
"Please specify a tenant-specific schema."
"Cannot use both tenant range filtering (tenant_range_start/tenant_range_end) "
"and specific schema filtering (schemas) at the same time. "
"Please use only one filtering method."
)
return schema_name, create_schema, upgrade_all_tenants, continue_on_error
if upgrade_all_tenants and specific_filtering:
raise ValueError(
"Cannot use both upgrade_all_tenants=true and schemas at the same time. "
"Use either upgrade_all_tenants=true for all tenants, or schemas for specific schemas."
)
# If any filtering parameters are specified, we're not doing the default single schema migration
if range_filtering:
upgrade_all_tenants = True
# Validate multi-tenant requirements
if MULTI_TENANT and not upgrade_all_tenants and not specific_filtering:
raise ValueError(
"In multi-tenant mode, you must specify either upgrade_all_tenants=true "
"or provide schemas. Cannot run default migration."
)
return (
create_schema,
upgrade_all_tenants,
continue_on_error,
tenant_range_start,
tenant_range_end,
schemas,
)
def do_run_migrations(
@@ -141,12 +262,20 @@ def provide_iam_token_for_alembic(
async def run_async_migrations() -> None:
(
schema_name,
create_schema,
upgrade_all_tenants,
continue_on_error,
tenant_range_start,
tenant_range_end,
schemas,
) = get_schema_options()
if not schemas and not MULTI_TENANT:
schemas = [POSTGRES_DEFAULT_SCHEMA]
# without init_engine, subsequent engine calls fail hard intentionally
SqlEngine.init_engine(pool_size=20, max_overflow=5)
engine = create_async_engine(
build_connection_string(),
poolclass=pool.NullPool,
@@ -160,12 +289,50 @@ async def run_async_migrations() -> None:
) -> None:
provide_iam_token_for_alembic(dialect, conn_rec, cargs, cparams)
if upgrade_all_tenants:
if schemas:
# Use specific schema names directly without fetching all tenants
logger.info(f"Migrating specific schema names: {schemas}")
i_schema = 0
num_schemas = len(schemas)
for schema in schemas:
i_schema += 1
logger.info(
f"Migrating schema: index={i_schema} num_schemas={num_schemas} schema={schema}"
)
try:
async with engine.connect() as connection:
await connection.run_sync(
do_run_migrations,
schema_name=schema,
create_schema=create_schema,
)
except Exception as e:
logger.error(f"Error migrating schema {schema}: {e}")
if not continue_on_error:
logger.error("--continue=true is not set, raising exception!")
raise
logger.warning("--continue=true is set, continuing to next schema.")
elif upgrade_all_tenants:
tenant_schemas = get_all_tenant_ids()
filtered_tenant_schemas = filter_tenants_by_range(
tenant_schemas, tenant_range_start, tenant_range_end
)
if tenant_range_start is not None or tenant_range_end is not None:
logger.info(
f"Filtering tenants by range: start={tenant_range_start}, end={tenant_range_end}"
)
logger.info(
f"Total tenants: {len(tenant_schemas)}, Filtered tenants: {len(filtered_tenant_schemas)}"
)
i_tenant = 0
num_tenants = len(tenant_schemas)
for schema in tenant_schemas:
num_tenants = len(filtered_tenant_schemas)
for schema in filtered_tenant_schemas:
i_tenant += 1
logger.info(
f"Migrating schema: index={i_tenant} num_tenants={num_tenants} schema={schema}"
@@ -180,36 +347,70 @@ async def run_async_migrations() -> None:
except Exception as e:
logger.error(f"Error migrating schema {schema}: {e}")
if not continue_on_error:
logger.error("--continue is not set, raising exception!")
logger.error("--continue=true is not set, raising exception!")
raise
logger.warning("--continue is set, continuing to next schema.")
logger.warning("--continue=true is set, continuing to next schema.")
else:
try:
logger.info(f"Migrating schema: {schema_name}")
async with engine.connect() as connection:
await connection.run_sync(
do_run_migrations,
schema_name=schema_name,
create_schema=create_schema,
)
except Exception as e:
logger.error(f"Error migrating schema {schema_name}: {e}")
raise
# This should not happen in the new design since we require either
# upgrade_all_tenants=true or schemas in multi-tenant mode
# and for non-multi-tenant mode, we should use schemas with the default schema
raise ValueError(
"No migration target specified. Use either upgrade_all_tenants=true for all tenants "
"or schemas for specific schemas."
)
await engine.dispose()
def run_migrations_offline() -> None:
"""This doesn't really get used when we migrate in the cloud."""
"""
NOTE(rkuo): This generates a sql script that can be used to migrate the database ...
instead of migrating the db live via an open connection
Not clear on when this would be used by us or if it even works.
If it is offline, then why are there calls to the db engine?
This doesn't really get used when we migrate in the cloud."""
logger.info("run_migrations_offline starting.")
schema_name, _, upgrade_all_tenants, continue_on_error = get_schema_options()
# without init_engine, subsequent engine calls fail hard intentionally
SqlEngine.init_engine(pool_size=20, max_overflow=5)
(
create_schema,
upgrade_all_tenants,
continue_on_error,
tenant_range_start,
tenant_range_end,
schemas,
) = get_schema_options()
url = build_connection_string()
if upgrade_all_tenants:
if schemas:
# Use specific schema names directly without fetching all tenants
logger.info(f"Migrating specific schema names: {schemas}")
for schema in schemas:
logger.info(f"Migrating schema: {schema}")
context.configure(
url=url,
target_metadata=target_metadata, # type: ignore
literal_binds=True,
include_object=include_object,
version_table_schema=schema,
include_schemas=True,
script_location=config.get_main_option("script_location"),
dialect_opts={"paramstyle": "named"},
)
with context.begin_transaction():
context.run_migrations()
elif upgrade_all_tenants:
engine = create_async_engine(url)
if USE_IAM_AUTH:
@@ -223,7 +424,19 @@ def run_migrations_offline() -> None:
tenant_schemas = get_all_tenant_ids()
engine.sync_engine.dispose()
for schema in tenant_schemas:
filtered_tenant_schemas = filter_tenants_by_range(
tenant_schemas, tenant_range_start, tenant_range_end
)
if tenant_range_start is not None or tenant_range_end is not None:
logger.info(
f"Filtering tenants by range: start={tenant_range_start}, end={tenant_range_end}"
)
logger.info(
f"Total tenants: {len(tenant_schemas)}, Filtered tenants: {len(filtered_tenant_schemas)}"
)
for schema in filtered_tenant_schemas:
logger.info(f"Migrating schema: {schema}")
context.configure(
url=url,
@@ -239,21 +452,12 @@ def run_migrations_offline() -> None:
with context.begin_transaction():
context.run_migrations()
else:
logger.info(f"Migrating schema: {schema_name}")
context.configure(
url=url,
target_metadata=target_metadata, # type: ignore
literal_binds=True,
include_object=include_object,
version_table_schema=schema_name,
include_schemas=True,
script_location=config.get_main_option("script_location"),
dialect_opts={"paramstyle": "named"},
# This should not happen in the new design
raise ValueError(
"No migration target specified. Use either upgrade_all_tenants=true for all tenants "
"or schemas for specific schemas."
)
with context.begin_transaction():
context.run_migrations()
def run_migrations_online() -> None:
logger.info("run_migrations_online starting.")

View File

@@ -0,0 +1,121 @@
"""rework-kg-config
Revision ID: 03bf8be6b53a
Revises: 65bc6e0f8500
Create Date: 2025-06-16 10:52:34.815335
"""
import json
from datetime import datetime
from datetime import timedelta
from sqlalchemy.dialects import postgresql
from sqlalchemy import text
from alembic import op
import sqlalchemy as sa
# revision identifiers, used by Alembic.
revision = "03bf8be6b53a"
down_revision = "65bc6e0f8500"
branch_labels = None
depends_on = None
def upgrade() -> None:
# get current config
current_configs = (
op.get_bind()
.execute(text("SELECT kg_variable_name, kg_variable_values FROM kg_config"))
.all()
)
current_config_dict = {
config.kg_variable_name: (
config.kg_variable_values[0]
if config.kg_variable_name
not in ("KG_VENDOR_DOMAINS", "KG_IGNORE_EMAIL_DOMAINS")
else config.kg_variable_values
)
for config in current_configs
if config.kg_variable_values
}
# not using the KGConfigSettings model here in case it changes in the future
kg_config_settings = json.dumps(
{
"KG_EXPOSED": current_config_dict.get("KG_EXPOSED", False),
"KG_ENABLED": current_config_dict.get("KG_ENABLED", False),
"KG_VENDOR": current_config_dict.get("KG_VENDOR", None),
"KG_VENDOR_DOMAINS": current_config_dict.get("KG_VENDOR_DOMAINS", []),
"KG_IGNORE_EMAIL_DOMAINS": current_config_dict.get(
"KG_IGNORE_EMAIL_DOMAINS", []
),
"KG_COVERAGE_START": current_config_dict.get(
"KG_COVERAGE_START",
(datetime.now() - timedelta(days=90)).strftime("%Y-%m-%d"),
),
"KG_MAX_COVERAGE_DAYS": current_config_dict.get("KG_MAX_COVERAGE_DAYS", 90),
"KG_MAX_PARENT_RECURSION_DEPTH": current_config_dict.get(
"KG_MAX_PARENT_RECURSION_DEPTH", 2
),
"KG_BETA_PERSONA_ID": current_config_dict.get("KG_BETA_PERSONA_ID", None),
}
)
op.execute(
f"INSERT INTO key_value_store (key, value) VALUES ('kg_config', '{kg_config_settings}')"
)
# drop kg config table
op.drop_table("kg_config")
def downgrade() -> None:
# get current config
current_config_dict = {
"KG_EXPOSED": False,
"KG_ENABLED": False,
"KG_VENDOR": [],
"KG_VENDOR_DOMAINS": [],
"KG_IGNORE_EMAIL_DOMAINS": [],
"KG_COVERAGE_START": (datetime.now() - timedelta(days=90)).strftime("%Y-%m-%d"),
"KG_MAX_COVERAGE_DAYS": 90,
"KG_MAX_PARENT_RECURSION_DEPTH": 2,
}
current_configs = (
op.get_bind()
.execute(text("SELECT value FROM key_value_store WHERE key = 'kg_config'"))
.one_or_none()
)
if current_configs is not None:
current_config_dict.update(current_configs[0])
insert_values = [
{
"kg_variable_name": name,
"kg_variable_values": (
[str(val).lower() if isinstance(val, bool) else str(val)]
if not isinstance(val, list)
else val
),
}
for name, val in current_config_dict.items()
]
op.create_table(
"kg_config",
sa.Column("id", sa.Integer(), primary_key=True, nullable=False, index=True),
sa.Column("kg_variable_name", sa.String(), nullable=False, index=True),
sa.Column("kg_variable_values", postgresql.ARRAY(sa.String()), nullable=False),
sa.UniqueConstraint("kg_variable_name", name="uq_kg_config_variable_name"),
)
op.bulk_insert(
sa.table(
"kg_config",
sa.column("kg_variable_name", sa.String),
sa.column("kg_variable_values", postgresql.ARRAY(sa.String)),
),
insert_values,
)
op.execute("DELETE FROM key_value_store WHERE key = 'kg_config'")

View File

@@ -0,0 +1,153 @@
"""add permission sync attempt tables
Revision ID: 03d710ccf29c
Revises: 96a5702df6aa
Create Date: 2025-09-11 13:30:00.000000
"""
from alembic import op
import sqlalchemy as sa
# revision identifiers, used by Alembic.
revision = "03d710ccf29c" # Generate a new unique ID
down_revision = "96a5702df6aa"
branch_labels = None
depends_on = None
def upgrade() -> None:
# Create the permission sync status enum
permission_sync_status_enum = sa.Enum(
"not_started",
"in_progress",
"success",
"canceled",
"failed",
"completed_with_errors",
name="permissionsyncstatus",
native_enum=False,
)
# Create doc_permission_sync_attempt table
op.create_table(
"doc_permission_sync_attempt",
sa.Column("id", sa.Integer(), nullable=False),
sa.Column("connector_credential_pair_id", sa.Integer(), nullable=False),
sa.Column("status", permission_sync_status_enum, nullable=False),
sa.Column("total_docs_synced", sa.Integer(), nullable=True),
sa.Column("docs_with_permission_errors", sa.Integer(), nullable=True),
sa.Column("error_message", sa.Text(), nullable=True),
sa.Column(
"time_created",
sa.DateTime(timezone=True),
server_default=sa.text("now()"),
nullable=False,
),
sa.Column("time_started", sa.DateTime(timezone=True), nullable=True),
sa.Column("time_finished", sa.DateTime(timezone=True), nullable=True),
sa.ForeignKeyConstraint(
["connector_credential_pair_id"],
["connector_credential_pair.id"],
),
sa.PrimaryKeyConstraint("id"),
)
# Create indexes for doc_permission_sync_attempt
op.create_index(
"ix_doc_permission_sync_attempt_time_created",
"doc_permission_sync_attempt",
["time_created"],
unique=False,
)
op.create_index(
"ix_permission_sync_attempt_latest_for_cc_pair",
"doc_permission_sync_attempt",
["connector_credential_pair_id", "time_created"],
unique=False,
)
op.create_index(
"ix_permission_sync_attempt_status_time",
"doc_permission_sync_attempt",
["status", sa.text("time_finished DESC")],
unique=False,
)
# Create external_group_permission_sync_attempt table
# connector_credential_pair_id is nullable - group syncs can be global (e.g., Confluence)
op.create_table(
"external_group_permission_sync_attempt",
sa.Column("id", sa.Integer(), nullable=False),
sa.Column("connector_credential_pair_id", sa.Integer(), nullable=True),
sa.Column("status", permission_sync_status_enum, nullable=False),
sa.Column("total_users_processed", sa.Integer(), nullable=True),
sa.Column("total_groups_processed", sa.Integer(), nullable=True),
sa.Column("total_group_memberships_synced", sa.Integer(), nullable=True),
sa.Column("error_message", sa.Text(), nullable=True),
sa.Column(
"time_created",
sa.DateTime(timezone=True),
server_default=sa.text("now()"),
nullable=False,
),
sa.Column("time_started", sa.DateTime(timezone=True), nullable=True),
sa.Column("time_finished", sa.DateTime(timezone=True), nullable=True),
sa.ForeignKeyConstraint(
["connector_credential_pair_id"],
["connector_credential_pair.id"],
),
sa.PrimaryKeyConstraint("id"),
)
# Create indexes for external_group_permission_sync_attempt
op.create_index(
"ix_external_group_permission_sync_attempt_time_created",
"external_group_permission_sync_attempt",
["time_created"],
unique=False,
)
op.create_index(
"ix_group_sync_attempt_cc_pair_time",
"external_group_permission_sync_attempt",
["connector_credential_pair_id", "time_created"],
unique=False,
)
op.create_index(
"ix_group_sync_attempt_status_time",
"external_group_permission_sync_attempt",
["status", sa.text("time_finished DESC")],
unique=False,
)
def downgrade() -> None:
# Drop indexes
op.drop_index(
"ix_group_sync_attempt_status_time",
table_name="external_group_permission_sync_attempt",
)
op.drop_index(
"ix_group_sync_attempt_cc_pair_time",
table_name="external_group_permission_sync_attempt",
)
op.drop_index(
"ix_external_group_permission_sync_attempt_time_created",
table_name="external_group_permission_sync_attempt",
)
op.drop_index(
"ix_permission_sync_attempt_status_time",
table_name="doc_permission_sync_attempt",
)
op.drop_index(
"ix_permission_sync_attempt_latest_for_cc_pair",
table_name="doc_permission_sync_attempt",
)
op.drop_index(
"ix_doc_permission_sync_attempt_time_created",
table_name="doc_permission_sync_attempt",
)
# Drop tables
op.drop_table("external_group_permission_sync_attempt")
op.drop_table("doc_permission_sync_attempt")

View File

@@ -0,0 +1,72 @@
"""add federated connector tables
Revision ID: 0816326d83aa
Revises: 12635f6655b7
Create Date: 2025-06-29 14:09:45.109518
"""
from alembic import op
import sqlalchemy as sa
from sqlalchemy.dialects import postgresql
# revision identifiers, used by Alembic.
revision = "0816326d83aa"
down_revision = "12635f6655b7"
branch_labels = None
depends_on = None
def upgrade() -> None:
# Create federated_connector table
op.create_table(
"federated_connector",
sa.Column("id", sa.Integer(), nullable=False),
sa.Column("source", sa.String(), nullable=False),
sa.Column("credentials", sa.LargeBinary(), nullable=False),
sa.PrimaryKeyConstraint("id"),
)
# Create federated_connector_oauth_token table
op.create_table(
"federated_connector_oauth_token",
sa.Column("id", sa.Integer(), nullable=False),
sa.Column("federated_connector_id", sa.Integer(), nullable=False),
sa.Column("user_id", postgresql.UUID(as_uuid=True), nullable=False),
sa.Column("token", sa.LargeBinary(), nullable=False),
sa.Column("expires_at", sa.DateTime(), nullable=True),
sa.ForeignKeyConstraint(
["federated_connector_id"], ["federated_connector.id"], ondelete="CASCADE"
),
sa.ForeignKeyConstraint(["user_id"], ["user.id"], ondelete="CASCADE"),
sa.PrimaryKeyConstraint("id"),
)
# Create federated_connector__document_set table
op.create_table(
"federated_connector__document_set",
sa.Column("id", sa.Integer(), nullable=False),
sa.Column("federated_connector_id", sa.Integer(), nullable=False),
sa.Column("document_set_id", sa.Integer(), nullable=False),
sa.Column("entities", postgresql.JSONB(), nullable=False),
sa.ForeignKeyConstraint(
["federated_connector_id"], ["federated_connector.id"], ondelete="CASCADE"
),
sa.ForeignKeyConstraint(
["document_set_id"], ["document_set.id"], ondelete="CASCADE"
),
sa.PrimaryKeyConstraint("id"),
sa.UniqueConstraint(
"federated_connector_id",
"document_set_id",
name="uq_federated_connector_document_set",
),
)
def downgrade() -> None:
# Drop tables in reverse order due to foreign key dependencies
op.drop_table("federated_connector__document_set")
op.drop_table("federated_connector_oauth_token")
op.drop_table("federated_connector")

View File

@@ -0,0 +1,389 @@
"""Migration 2: User file data preparation and backfill
Revision ID: 0cd424f32b1d
Revises: 9b66d3156fc6
Create Date: 2025-09-22 09:44:42.727034
This migration populates the new columns added in migration 1.
It prepares data for the UUID transition and relationship migration.
"""
from alembic import op
import sqlalchemy as sa
from sqlalchemy import text
import logging
logger = logging.getLogger("alembic.runtime.migration")
# revision identifiers, used by Alembic.
revision = "0cd424f32b1d"
down_revision = "9b66d3156fc6"
branch_labels = None
depends_on = None
def upgrade() -> None:
"""Populate new columns with data."""
bind = op.get_bind()
inspector = sa.inspect(bind)
# === Step 1: Populate user_file.new_id ===
user_file_columns = [col["name"] for col in inspector.get_columns("user_file")]
has_new_id = "new_id" in user_file_columns
if has_new_id:
logger.info("Populating user_file.new_id with UUIDs...")
# Count rows needing UUIDs
null_count = bind.execute(
text("SELECT COUNT(*) FROM user_file WHERE new_id IS NULL")
).scalar_one()
if null_count > 0:
logger.info(f"Generating UUIDs for {null_count} user_file records...")
# Populate in batches to avoid long locks
batch_size = 10000
total_updated = 0
while True:
result = bind.execute(
text(
"""
UPDATE user_file
SET new_id = gen_random_uuid()
WHERE new_id IS NULL
AND id IN (
SELECT id FROM user_file
WHERE new_id IS NULL
LIMIT :batch_size
)
"""
),
{"batch_size": batch_size},
)
updated = result.rowcount
total_updated += updated
if updated < batch_size:
break
logger.info(f" Updated {total_updated}/{null_count} records...")
logger.info(f"Generated UUIDs for {total_updated} user_file records")
# Verify all records have UUIDs
remaining_null = bind.execute(
text("SELECT COUNT(*) FROM user_file WHERE new_id IS NULL")
).scalar_one()
if remaining_null > 0:
raise Exception(
f"Failed to populate all user_file.new_id values ({remaining_null} NULL)"
)
# Lock down the column
op.alter_column("user_file", "new_id", nullable=False)
op.alter_column("user_file", "new_id", server_default=None)
logger.info("Locked down user_file.new_id column")
# === Step 2: Populate persona__user_file.user_file_id_uuid ===
persona_user_file_columns = [
col["name"] for col in inspector.get_columns("persona__user_file")
]
if has_new_id and "user_file_id_uuid" in persona_user_file_columns:
logger.info("Populating persona__user_file.user_file_id_uuid...")
# Count rows needing update
null_count = bind.execute(
text(
"""
SELECT COUNT(*) FROM persona__user_file
WHERE user_file_id IS NOT NULL AND user_file_id_uuid IS NULL
"""
)
).scalar_one()
if null_count > 0:
logger.info(f"Updating {null_count} persona__user_file records...")
# Update in batches
batch_size = 10000
total_updated = 0
while True:
result = bind.execute(
text(
"""
UPDATE persona__user_file p
SET user_file_id_uuid = uf.new_id
FROM user_file uf
WHERE p.user_file_id = uf.id
AND p.user_file_id_uuid IS NULL
AND p.persona_id IN (
SELECT persona_id
FROM persona__user_file
WHERE user_file_id_uuid IS NULL
LIMIT :batch_size
)
"""
),
{"batch_size": batch_size},
)
updated = result.rowcount
total_updated += updated
if updated < batch_size:
break
logger.info(f" Updated {total_updated}/{null_count} records...")
logger.info(f"Updated {total_updated} persona__user_file records")
# Verify all records are populated
remaining_null = bind.execute(
text(
"""
SELECT COUNT(*) FROM persona__user_file
WHERE user_file_id IS NOT NULL AND user_file_id_uuid IS NULL
"""
)
).scalar_one()
if remaining_null > 0:
raise Exception(
f"Failed to populate all persona__user_file.user_file_id_uuid values ({remaining_null} NULL)"
)
op.alter_column("persona__user_file", "user_file_id_uuid", nullable=False)
logger.info("Locked down persona__user_file.user_file_id_uuid column")
# === Step 3: Create user_project records from chat_folder ===
if "chat_folder" in inspector.get_table_names():
logger.info("Creating user_project records from chat_folder...")
result = bind.execute(
text(
"""
INSERT INTO user_project (user_id, name)
SELECT cf.user_id, cf.name
FROM chat_folder cf
WHERE NOT EXISTS (
SELECT 1
FROM user_project up
WHERE up.user_id = cf.user_id AND up.name = cf.name
)
"""
)
)
logger.info(f"Created {result.rowcount} user_project records from chat_folder")
# === Step 4: Populate chat_session.project_id ===
chat_session_columns = [
col["name"] for col in inspector.get_columns("chat_session")
]
if "folder_id" in chat_session_columns and "project_id" in chat_session_columns:
logger.info("Populating chat_session.project_id...")
# Count sessions needing update
null_count = bind.execute(
text(
"""
SELECT COUNT(*) FROM chat_session
WHERE project_id IS NULL AND folder_id IS NOT NULL
"""
)
).scalar_one()
if null_count > 0:
logger.info(f"Updating {null_count} chat_session records...")
result = bind.execute(
text(
"""
UPDATE chat_session cs
SET project_id = up.id
FROM chat_folder cf
JOIN user_project up ON up.user_id = cf.user_id AND up.name = cf.name
WHERE cs.folder_id = cf.id AND cs.project_id IS NULL
"""
)
)
logger.info(f"Updated {result.rowcount} chat_session records")
# Verify all records are populated
remaining_null = bind.execute(
text(
"""
SELECT COUNT(*) FROM chat_session
WHERE project_id IS NULL AND folder_id IS NOT NULL
"""
)
).scalar_one()
if remaining_null > 0:
logger.warning(
f"Warning: {remaining_null} chat_session records could not be mapped to projects"
)
# === Step 5: Update plaintext FileRecord IDs/display names to UUID scheme ===
# Prior to UUID migration, plaintext cache files were stored with file_id like 'plain_text_<int_id>'.
# After migration, we use 'plaintext_<uuid>' (note the name change to 'plaintext_').
# This step remaps existing FileRecord rows to the new naming while preserving object_key/bucket.
logger.info("Updating plaintext FileRecord ids and display names to UUID scheme...")
# Count legacy plaintext records that can be mapped to UUID user_file ids
count_query = text(
"""
SELECT COUNT(*)
FROM file_record fr
JOIN user_file uf ON fr.file_id = CONCAT('plaintext_', uf.id::text)
WHERE LOWER(fr.file_origin::text) = 'plaintext_cache'
"""
)
legacy_count = bind.execute(count_query).scalar_one()
if legacy_count and legacy_count > 0:
logger.info(f"Found {legacy_count} legacy plaintext file records to update")
# Update display_name first for readability (safe regardless of rename)
bind.execute(
text(
"""
UPDATE file_record fr
SET display_name = CONCAT('Plaintext for user file ', uf.new_id::text)
FROM user_file uf
WHERE LOWER(fr.file_origin::text) = 'plaintext_cache'
AND fr.file_id = CONCAT('plaintext_', uf.id::text)
"""
)
)
# Remap file_id from 'plaintext_<int>' -> 'plaintext_<uuid>' using transitional new_id
# Use a single UPDATE ... WHERE file_id LIKE 'plain_text_%'
# and ensure it aligns to existing user_file ids to avoid renaming unrelated rows
result = bind.execute(
text(
"""
UPDATE file_record fr
SET file_id = CONCAT('plaintext_', uf.new_id::text)
FROM user_file uf
WHERE LOWER(fr.file_origin::text) = 'plaintext_cache'
AND fr.file_id = CONCAT('plaintext_', uf.id::text)
"""
)
)
logger.info(
f"Updated {result.rowcount} plaintext file_record ids to UUID scheme"
)
# === Step 6: Ensure document_id_migrated default TRUE and backfill existing FALSE ===
# New records should default to migrated=True so the migration task won't run for them.
# Existing rows that had a legacy document_id should be marked as not migrated to be processed.
# Backfill existing records: if document_id is not null, set to FALSE
bind.execute(
text(
"""
UPDATE user_file
SET document_id_migrated = FALSE
WHERE document_id IS NOT NULL
"""
)
)
# === Step 7: Backfill user_file.status from index_attempt ===
logger.info("Backfilling user_file.status from index_attempt...")
# Update user_file status based on latest index attempt
# Using CTEs instead of temp tables for asyncpg compatibility
result = bind.execute(
text(
"""
WITH latest_attempt AS (
SELECT DISTINCT ON (ia.connector_credential_pair_id)
ia.connector_credential_pair_id,
ia.status
FROM index_attempt ia
ORDER BY ia.connector_credential_pair_id, ia.time_updated DESC
),
uf_to_ccp AS (
SELECT DISTINCT uf.id AS uf_id, ccp.id AS cc_pair_id
FROM user_file uf
JOIN document_by_connector_credential_pair dcc
ON dcc.id = REPLACE(uf.document_id, 'USER_FILE_CONNECTOR__', 'FILE_CONNECTOR__')
JOIN connector_credential_pair ccp
ON ccp.connector_id = dcc.connector_id
AND ccp.credential_id = dcc.credential_id
)
UPDATE user_file uf
SET status = CASE
WHEN la.status IN ('NOT_STARTED', 'IN_PROGRESS') THEN 'PROCESSING'
WHEN la.status = 'SUCCESS' THEN 'COMPLETED'
ELSE 'FAILED'
END
FROM uf_to_ccp ufc
LEFT JOIN latest_attempt la
ON la.connector_credential_pair_id = ufc.cc_pair_id
WHERE uf.id = ufc.uf_id
AND uf.status = 'PROCESSING'
"""
)
)
logger.info(f"Updated status for {result.rowcount} user_file records")
logger.info("Migration 2 (data preparation) completed successfully")
def downgrade() -> None:
"""Reset populated data to allow clean downgrade of schema."""
bind = op.get_bind()
inspector = sa.inspect(bind)
logger.info("Starting downgrade of data preparation...")
# Reset user_file columns to allow nulls before data removal
if "user_file" in inspector.get_table_names():
columns = [col["name"] for col in inspector.get_columns("user_file")]
if "new_id" in columns:
op.alter_column(
"user_file",
"new_id",
nullable=True,
server_default=sa.text("gen_random_uuid()"),
)
# Optionally clear the data
# bind.execute(text("UPDATE user_file SET new_id = NULL"))
logger.info("Reset user_file.new_id to nullable")
# Reset persona__user_file.user_file_id_uuid
if "persona__user_file" in inspector.get_table_names():
columns = [col["name"] for col in inspector.get_columns("persona__user_file")]
if "user_file_id_uuid" in columns:
op.alter_column("persona__user_file", "user_file_id_uuid", nullable=True)
# Optionally clear the data
# bind.execute(text("UPDATE persona__user_file SET user_file_id_uuid = NULL"))
logger.info("Reset persona__user_file.user_file_id_uuid to nullable")
# Note: We don't delete user_project records or reset chat_session.project_id
# as these might be in use and can be handled by the schema downgrade
# Reset user_file.status to default
if "user_file" in inspector.get_table_names():
columns = [col["name"] for col in inspector.get_columns("user_file")]
if "status" in columns:
bind.execute(text("UPDATE user_file SET status = 'PROCESSING'"))
logger.info("Reset user_file.status to default")
logger.info("Downgrade completed successfully")

View File

@@ -0,0 +1,596 @@
"""drive-canonical-ids
Revision ID: 12635f6655b7
Revises: 58c50ef19f08
Create Date: 2025-06-20 14:44:54.241159
"""
from alembic import op
import sqlalchemy as sa
from urllib.parse import urlparse, urlunparse
from httpx import HTTPStatusError
import httpx
from onyx.document_index.factory import get_default_document_index
from onyx.db.search_settings import SearchSettings
from onyx.document_index.vespa.shared_utils.utils import get_vespa_http_client
from onyx.document_index.vespa.shared_utils.utils import (
replace_invalid_doc_id_characters,
)
from onyx.document_index.vespa_constants import DOCUMENT_ID_ENDPOINT
from onyx.utils.logger import setup_logger
import os
logger = setup_logger()
# revision identifiers, used by Alembic.
revision = "12635f6655b7"
down_revision = "58c50ef19f08"
branch_labels = None
depends_on = None
SKIP_CANON_DRIVE_IDS = os.environ.get("SKIP_CANON_DRIVE_IDS", "true").lower() == "true"
def active_search_settings() -> tuple[SearchSettings, SearchSettings | None]:
result = op.get_bind().execute(
sa.text(
"""
SELECT * FROM search_settings WHERE status = 'PRESENT' ORDER BY id DESC LIMIT 1
"""
)
)
search_settings_fetch = result.fetchall()
search_settings = (
SearchSettings(**search_settings_fetch[0]._asdict())
if search_settings_fetch
else None
)
result2 = op.get_bind().execute(
sa.text(
"""
SELECT * FROM search_settings WHERE status = 'FUTURE' ORDER BY id DESC LIMIT 1
"""
)
)
search_settings_future_fetch = result2.fetchall()
search_settings_future = (
SearchSettings(**search_settings_future_fetch[0]._asdict())
if search_settings_future_fetch
else None
)
if not isinstance(search_settings, SearchSettings):
raise RuntimeError(
"current search settings is of type " + str(type(search_settings))
)
if (
not isinstance(search_settings_future, SearchSettings)
and search_settings_future is not None
):
raise RuntimeError(
"future search settings is of type " + str(type(search_settings_future))
)
return search_settings, search_settings_future
def normalize_google_drive_url(url: str) -> str:
"""Remove query parameters from Google Drive URLs to create canonical document IDs.
NOTE: copied from drive doc_conversion.py
"""
parsed_url = urlparse(url)
parsed_url = parsed_url._replace(query="")
spl_path = parsed_url.path.split("/")
if spl_path and (spl_path[-1] in ["edit", "view", "preview"]):
spl_path.pop()
parsed_url = parsed_url._replace(path="/".join(spl_path))
# Remove query parameters and reconstruct URL
return urlunparse(parsed_url)
def get_google_drive_documents_from_database() -> list[dict]:
"""Get all Google Drive documents from the database."""
bind = op.get_bind()
result = bind.execute(
sa.text(
"""
SELECT d.id
FROM document d
JOIN document_by_connector_credential_pair dcc ON d.id = dcc.id
JOIN connector_credential_pair cc ON dcc.connector_id = cc.connector_id
AND dcc.credential_id = cc.credential_id
JOIN connector c ON cc.connector_id = c.id
WHERE c.source = 'GOOGLE_DRIVE'
"""
)
)
documents = []
for row in result:
documents.append({"document_id": row.id})
return documents
def update_document_id_in_database(
old_doc_id: str, new_doc_id: str, index_name: str
) -> None:
"""Update document IDs in all relevant database tables using copy-and-swap approach."""
bind = op.get_bind()
# print(f"Updating database tables for document {old_doc_id} -> {new_doc_id}")
# Check if new document ID already exists
result = bind.execute(
sa.text("SELECT COUNT(*) FROM document WHERE id = :new_id"),
{"new_id": new_doc_id},
)
row = result.fetchone()
if row and row[0] > 0:
# print(f"Document with ID {new_doc_id} already exists, deleting old one")
delete_document_from_db(old_doc_id, index_name)
return
# Step 1: Create a new document row with the new ID (copy all fields from old row)
# Use a conservative approach to handle columns that might not exist in all installations
try:
bind.execute(
sa.text(
"""
INSERT INTO document (id, from_ingestion_api, boost, hidden, semantic_id,
link, doc_updated_at, primary_owners, secondary_owners,
external_user_emails, external_user_group_ids, is_public,
chunk_count, last_modified, last_synced, kg_stage, kg_processing_time)
SELECT :new_id, from_ingestion_api, boost, hidden, semantic_id,
link, doc_updated_at, primary_owners, secondary_owners,
external_user_emails, external_user_group_ids, is_public,
chunk_count, last_modified, last_synced, kg_stage, kg_processing_time
FROM document
WHERE id = :old_id
"""
),
{"new_id": new_doc_id, "old_id": old_doc_id},
)
# print(f"Successfully updated database tables for document {old_doc_id} -> {new_doc_id}")
except Exception as e:
# If the full INSERT fails, try a more basic version with only core columns
logger.warning(f"Full INSERT failed, trying basic version: {e}")
bind.execute(
sa.text(
"""
INSERT INTO document (id, from_ingestion_api, boost, hidden, semantic_id,
link, doc_updated_at, primary_owners, secondary_owners)
SELECT :new_id, from_ingestion_api, boost, hidden, semantic_id,
link, doc_updated_at, primary_owners, secondary_owners
FROM document
WHERE id = :old_id
"""
),
{"new_id": new_doc_id, "old_id": old_doc_id},
)
# Step 2: Update all foreign key references to point to the new ID
# Update document_by_connector_credential_pair table
bind.execute(
sa.text(
"UPDATE document_by_connector_credential_pair SET id = :new_id WHERE id = :old_id"
),
{"new_id": new_doc_id, "old_id": old_doc_id},
)
# print(f"Successfully updated document_by_connector_credential_pair table for document {old_doc_id} -> {new_doc_id}")
# Update search_doc table (stores search results for chat replay)
# This is critical for agent functionality
bind.execute(
sa.text(
"UPDATE search_doc SET document_id = :new_id WHERE document_id = :old_id"
),
{"new_id": new_doc_id, "old_id": old_doc_id},
)
# print(f"Successfully updated search_doc table for document {old_doc_id} -> {new_doc_id}")
# Update document_retrieval_feedback table (user feedback on documents)
bind.execute(
sa.text(
"UPDATE document_retrieval_feedback SET document_id = :new_id WHERE document_id = :old_id"
),
{"new_id": new_doc_id, "old_id": old_doc_id},
)
# print(f"Successfully updated document_retrieval_feedback table for document {old_doc_id} -> {new_doc_id}")
# Update document__tag table (document-tag relationships)
bind.execute(
sa.text(
"UPDATE document__tag SET document_id = :new_id WHERE document_id = :old_id"
),
{"new_id": new_doc_id, "old_id": old_doc_id},
)
# print(f"Successfully updated document__tag table for document {old_doc_id} -> {new_doc_id}")
# Update user_file table (user uploaded files linked to documents)
bind.execute(
sa.text(
"UPDATE user_file SET document_id = :new_id WHERE document_id = :old_id"
),
{"new_id": new_doc_id, "old_id": old_doc_id},
)
# print(f"Successfully updated user_file table for document {old_doc_id} -> {new_doc_id}")
# Update KG and chunk_stats tables (these may not exist in all installations)
try:
# Update kg_entity table
bind.execute(
sa.text(
"UPDATE kg_entity SET document_id = :new_id WHERE document_id = :old_id"
),
{"new_id": new_doc_id, "old_id": old_doc_id},
)
# print(f"Successfully updated kg_entity table for document {old_doc_id} -> {new_doc_id}")
# Update kg_entity_extraction_staging table
bind.execute(
sa.text(
"UPDATE kg_entity_extraction_staging SET document_id = :new_id WHERE document_id = :old_id"
),
{"new_id": new_doc_id, "old_id": old_doc_id},
)
# print(f"Successfully updated kg_entity_extraction_staging table for document {old_doc_id} -> {new_doc_id}")
# Update kg_relationship table
bind.execute(
sa.text(
"UPDATE kg_relationship SET source_document = :new_id WHERE source_document = :old_id"
),
{"new_id": new_doc_id, "old_id": old_doc_id},
)
# print(f"Successfully updated kg_relationship table for document {old_doc_id} -> {new_doc_id}")
# Update kg_relationship_extraction_staging table
bind.execute(
sa.text(
"UPDATE kg_relationship_extraction_staging SET source_document = :new_id WHERE source_document = :old_id"
),
{"new_id": new_doc_id, "old_id": old_doc_id},
)
# print(f"Successfully updated kg_relationship_extraction_staging table for document {old_doc_id} -> {new_doc_id}")
# Update chunk_stats table
bind.execute(
sa.text(
"UPDATE chunk_stats SET document_id = :new_id WHERE document_id = :old_id"
),
{"new_id": new_doc_id, "old_id": old_doc_id},
)
# print(f"Successfully updated chunk_stats table for document {old_doc_id} -> {new_doc_id}")
# Update chunk_stats ID field which includes document_id
bind.execute(
sa.text(
"""
UPDATE chunk_stats
SET id = REPLACE(id, :old_id, :new_id)
WHERE id LIKE :old_id_pattern
"""
),
{
"new_id": new_doc_id,
"old_id": old_doc_id,
"old_id_pattern": f"{old_doc_id}__%",
},
)
# print(f"Successfully updated chunk_stats ID field for document {old_doc_id} -> {new_doc_id}")
except Exception as e:
logger.warning(f"Some KG/chunk tables may not exist or failed to update: {e}")
# Step 3: Delete the old document row (this should now be safe since all FKs point to new row)
bind.execute(
sa.text("DELETE FROM document WHERE id = :old_id"), {"old_id": old_doc_id}
)
# print(f"Successfully deleted document {old_doc_id} from database")
def _visit_chunks(
*,
http_client: httpx.Client,
index_name: str,
selection: str,
continuation: str | None = None,
) -> tuple[list[dict], str | None]:
"""Helper that calls the /document/v1 visit API once and returns (docs, next_token)."""
# Use the same URL as the document API, but with visit-specific params
base_url = DOCUMENT_ID_ENDPOINT.format(index_name=index_name)
params: dict[str, str] = {
"selection": selection,
"wantedDocumentCount": "1000",
}
if continuation:
params["continuation"] = continuation
# print(f"Visiting chunks for selection '{selection}' with params {params}")
resp = http_client.get(base_url, params=params, timeout=None)
# print(f"Visited chunks for document {selection}")
resp.raise_for_status()
payload = resp.json()
return payload.get("documents", []), payload.get("continuation")
def delete_document_chunks_from_vespa(index_name: str, doc_id: str) -> None:
"""Delete all chunks for *doc_id* from Vespa using continuation-token paging (no offset)."""
total_deleted = 0
# Use exact match instead of contains - Document Selector Language doesn't support contains
selection = f'{index_name}.document_id=="{doc_id}"'
with get_vespa_http_client() as http_client:
continuation: str | None = None
while True:
docs, continuation = _visit_chunks(
http_client=http_client,
index_name=index_name,
selection=selection,
continuation=continuation,
)
if not docs:
break
for doc in docs:
vespa_full_id = doc.get("id")
if not vespa_full_id:
continue
vespa_doc_uuid = vespa_full_id.split("::")[-1]
delete_url = f"{DOCUMENT_ID_ENDPOINT.format(index_name=index_name)}/{vespa_doc_uuid}"
try:
resp = http_client.delete(delete_url)
resp.raise_for_status()
total_deleted += 1
except Exception as e:
print(f"Failed to delete chunk {vespa_doc_uuid}: {e}")
if not continuation:
break
def update_document_id_in_vespa(
index_name: str, old_doc_id: str, new_doc_id: str
) -> None:
"""Update all chunks' document_id field from *old_doc_id* to *new_doc_id* using continuation paging."""
clean_new_doc_id = replace_invalid_doc_id_characters(new_doc_id)
# Use exact match instead of contains - Document Selector Language doesn't support contains
selection = f'{index_name}.document_id=="{old_doc_id}"'
with get_vespa_http_client() as http_client:
continuation: str | None = None
while True:
# print(f"Visiting chunks for document {old_doc_id} -> {new_doc_id}")
docs, continuation = _visit_chunks(
http_client=http_client,
index_name=index_name,
selection=selection,
continuation=continuation,
)
if not docs:
break
for doc in docs:
vespa_full_id = doc.get("id")
if not vespa_full_id:
continue
vespa_doc_uuid = vespa_full_id.split("::")[-1]
vespa_url = f"{DOCUMENT_ID_ENDPOINT.format(index_name=index_name)}/{vespa_doc_uuid}"
update_request = {
"fields": {"document_id": {"assign": clean_new_doc_id}}
}
try:
resp = http_client.put(vespa_url, json=update_request)
resp.raise_for_status()
except Exception as e:
print(f"Failed to update chunk {vespa_doc_uuid}: {e}")
raise
if not continuation:
break
def delete_document_from_db(current_doc_id: str, index_name: str) -> None:
# Delete all foreign key references first, then delete the document
try:
bind = op.get_bind()
# Delete from agent-related tables first (order matters due to foreign keys)
# Delete from agent__sub_query__search_doc first since it references search_doc
bind.execute(
sa.text(
"""
DELETE FROM agent__sub_query__search_doc
WHERE search_doc_id IN (
SELECT id FROM search_doc WHERE document_id = :doc_id
)
"""
),
{"doc_id": current_doc_id},
)
# Delete from chat_message__search_doc
bind.execute(
sa.text(
"""
DELETE FROM chat_message__search_doc
WHERE search_doc_id IN (
SELECT id FROM search_doc WHERE document_id = :doc_id
)
"""
),
{"doc_id": current_doc_id},
)
# Now we can safely delete from search_doc
bind.execute(
sa.text("DELETE FROM search_doc WHERE document_id = :doc_id"),
{"doc_id": current_doc_id},
)
# Delete from document_by_connector_credential_pair
bind.execute(
sa.text(
"DELETE FROM document_by_connector_credential_pair WHERE id = :doc_id"
),
{"doc_id": current_doc_id},
)
# Delete from other tables that reference this document
bind.execute(
sa.text(
"DELETE FROM document_retrieval_feedback WHERE document_id = :doc_id"
),
{"doc_id": current_doc_id},
)
bind.execute(
sa.text("DELETE FROM document__tag WHERE document_id = :doc_id"),
{"doc_id": current_doc_id},
)
bind.execute(
sa.text("DELETE FROM user_file WHERE document_id = :doc_id"),
{"doc_id": current_doc_id},
)
# Delete from KG tables if they exist
try:
bind.execute(
sa.text("DELETE FROM kg_entity WHERE document_id = :doc_id"),
{"doc_id": current_doc_id},
)
bind.execute(
sa.text(
"DELETE FROM kg_entity_extraction_staging WHERE document_id = :doc_id"
),
{"doc_id": current_doc_id},
)
bind.execute(
sa.text("DELETE FROM kg_relationship WHERE source_document = :doc_id"),
{"doc_id": current_doc_id},
)
bind.execute(
sa.text(
"DELETE FROM kg_relationship_extraction_staging WHERE source_document = :doc_id"
),
{"doc_id": current_doc_id},
)
bind.execute(
sa.text("DELETE FROM chunk_stats WHERE document_id = :doc_id"),
{"doc_id": current_doc_id},
)
bind.execute(
sa.text("DELETE FROM chunk_stats WHERE id LIKE :doc_id_pattern"),
{"doc_id_pattern": f"{current_doc_id}__%"},
)
except Exception as e:
logger.warning(
f"Some KG/chunk tables may not exist or failed to delete from: {e}"
)
# Finally delete the document itself
bind.execute(
sa.text("DELETE FROM document WHERE id = :doc_id"),
{"doc_id": current_doc_id},
)
# Delete chunks from vespa
delete_document_chunks_from_vespa(index_name, current_doc_id)
except Exception as e:
print(f"Failed to delete duplicate document {current_doc_id}: {e}")
# Continue with other documents instead of failing the entire migration
def upgrade() -> None:
if SKIP_CANON_DRIVE_IDS:
return
current_search_settings, future_search_settings = active_search_settings()
document_index = get_default_document_index(
current_search_settings,
future_search_settings,
)
# Get the index name
if hasattr(document_index, "index_name"):
index_name = document_index.index_name
else:
# Default index name if we can't get it from the document_index
index_name = "danswer_index"
# Get all Google Drive documents from the database (this is faster and more reliable)
gdrive_documents = get_google_drive_documents_from_database()
if not gdrive_documents:
return
# Track normalized document IDs to detect duplicates
all_normalized_doc_ids = set()
updated_count = 0
for doc_info in gdrive_documents:
current_doc_id = doc_info["document_id"]
normalized_doc_id = normalize_google_drive_url(current_doc_id)
print(f"Processing document {current_doc_id} -> {normalized_doc_id}")
# Check for duplicates
if normalized_doc_id in all_normalized_doc_ids:
# print(f"Deleting duplicate document {current_doc_id}")
delete_document_from_db(current_doc_id, index_name)
continue
all_normalized_doc_ids.add(normalized_doc_id)
# If the document ID already doesn't have query parameters, skip it
if current_doc_id == normalized_doc_id:
# print(f"Skipping document {current_doc_id} -> {normalized_doc_id} because it already has no query parameters")
continue
try:
# Update both database and Vespa in order
# Database first to ensure consistency
update_document_id_in_database(
current_doc_id, normalized_doc_id, index_name
)
# For Vespa, we can now use the original document IDs since we're using contains matching
update_document_id_in_vespa(index_name, current_doc_id, normalized_doc_id)
updated_count += 1
# print(f"Finished updating document {current_doc_id} -> {normalized_doc_id}")
except Exception as e:
print(f"Failed to update document {current_doc_id}: {e}")
if isinstance(e, HTTPStatusError):
print(f"HTTPStatusError: {e}")
print(f"Response: {e.response.text}")
print(f"Status: {e.response.status_code}")
print(f"Headers: {e.response.headers}")
print(f"Request: {e.request.url}")
print(f"Request headers: {e.request.headers}")
# Note: Rollback is complex with copy-and-swap approach since the old document is already deleted
# In case of failure, manual intervention may be required
# Continue with other documents instead of failing the entire migration
continue
logger.info(f"Migration complete. Updated {updated_count} Google Drive documents")
def downgrade() -> None:
# this is a one way migration, so no downgrade.
# It wouldn't make sense to store the extra query parameters
# and duplicate documents to allow a reversal.
pass

View File

@@ -0,0 +1,261 @@
"""Migration 3: User file relationship migration
Revision ID: 16c37a30adf2
Revises: 0cd424f32b1d
Create Date: 2025-09-22 09:47:34.175596
This migration converts folder-based relationships to project-based relationships.
It migrates persona__user_folder to persona__user_file and populates project__user_file.
"""
from alembic import op
import sqlalchemy as sa
from sqlalchemy import text
import logging
logger = logging.getLogger("alembic.runtime.migration")
# revision identifiers, used by Alembic.
revision = "16c37a30adf2"
down_revision = "0cd424f32b1d"
branch_labels = None
depends_on = None
def upgrade() -> None:
"""Migrate folder-based relationships to project-based relationships."""
bind = op.get_bind()
inspector = sa.inspect(bind)
# === Step 1: Migrate persona__user_folder to persona__user_file ===
table_names = inspector.get_table_names()
if "persona__user_folder" in table_names and "user_file" in table_names:
user_file_columns = [col["name"] for col in inspector.get_columns("user_file")]
has_new_id = "new_id" in user_file_columns
if has_new_id and "folder_id" in user_file_columns:
logger.info(
"Migrating persona__user_folder relationships to persona__user_file..."
)
# Count relationships to migrate (asyncpg-compatible)
count_query = text(
"""
SELECT COUNT(*)
FROM (
SELECT DISTINCT puf.persona_id, uf.id
FROM persona__user_folder puf
JOIN user_file uf ON uf.folder_id = puf.user_folder_id
WHERE NOT EXISTS (
SELECT 1
FROM persona__user_file p2
WHERE p2.persona_id = puf.persona_id
AND p2.user_file_id = uf.id
)
) AS distinct_pairs
"""
)
to_migrate = bind.execute(count_query).scalar_one()
if to_migrate > 0:
logger.info(f"Creating {to_migrate} persona-file relationships...")
# Migrate in batches to avoid memory issues
batch_size = 10000
total_inserted = 0
while True:
# Insert batch directly using subquery (asyncpg compatible)
result = bind.execute(
text(
"""
INSERT INTO persona__user_file (persona_id, user_file_id, user_file_id_uuid)
SELECT DISTINCT puf.persona_id, uf.id as file_id, uf.new_id
FROM persona__user_folder puf
JOIN user_file uf ON uf.folder_id = puf.user_folder_id
WHERE NOT EXISTS (
SELECT 1
FROM persona__user_file p2
WHERE p2.persona_id = puf.persona_id
AND p2.user_file_id = uf.id
)
LIMIT :batch_size
"""
),
{"batch_size": batch_size},
)
inserted = result.rowcount
total_inserted += inserted
if inserted < batch_size:
break
logger.info(
f" Migrated {total_inserted}/{to_migrate} relationships..."
)
logger.info(
f"Created {total_inserted} persona__user_file relationships"
)
# === Step 2: Add foreign key for chat_session.project_id ===
chat_session_fks = inspector.get_foreign_keys("chat_session")
fk_exists = any(
fk["name"] == "fk_chat_session_project_id" for fk in chat_session_fks
)
if not fk_exists:
logger.info("Adding foreign key constraint for chat_session.project_id...")
op.create_foreign_key(
"fk_chat_session_project_id",
"chat_session",
"user_project",
["project_id"],
["id"],
)
logger.info("Added foreign key constraint")
# === Step 3: Populate project__user_file from user_file.folder_id ===
user_file_columns = [col["name"] for col in inspector.get_columns("user_file")]
has_new_id = "new_id" in user_file_columns
if has_new_id and "folder_id" in user_file_columns:
logger.info("Populating project__user_file from folder relationships...")
# Count relationships to create
count_query = text(
"""
SELECT COUNT(*)
FROM user_file uf
WHERE uf.folder_id IS NOT NULL
AND NOT EXISTS (
SELECT 1
FROM project__user_file puf
WHERE puf.project_id = uf.folder_id
AND puf.user_file_id = uf.new_id
)
"""
)
to_create = bind.execute(count_query).scalar_one()
if to_create > 0:
logger.info(f"Creating {to_create} project-file relationships...")
# Insert in batches
batch_size = 10000
total_inserted = 0
while True:
result = bind.execute(
text(
"""
INSERT INTO project__user_file (project_id, user_file_id)
SELECT uf.folder_id, uf.new_id
FROM user_file uf
WHERE uf.folder_id IS NOT NULL
AND NOT EXISTS (
SELECT 1
FROM project__user_file puf
WHERE puf.project_id = uf.folder_id
AND puf.user_file_id = uf.new_id
)
LIMIT :batch_size
ON CONFLICT (project_id, user_file_id) DO NOTHING
"""
),
{"batch_size": batch_size},
)
inserted = result.rowcount
total_inserted += inserted
if inserted < batch_size:
break
logger.info(f" Created {total_inserted}/{to_create} relationships...")
logger.info(f"Created {total_inserted} project__user_file relationships")
# === Step 4: Create index on chat_session.project_id ===
try:
indexes = [ix.get("name") for ix in inspector.get_indexes("chat_session")]
except Exception:
indexes = []
if "ix_chat_session_project_id" not in indexes:
logger.info("Creating index on chat_session.project_id...")
op.create_index(
"ix_chat_session_project_id", "chat_session", ["project_id"], unique=False
)
logger.info("Created index")
logger.info("Migration 3 (relationship migration) completed successfully")
def downgrade() -> None:
"""Remove migrated relationships and constraints."""
bind = op.get_bind()
inspector = sa.inspect(bind)
logger.info("Starting downgrade of relationship migration...")
# Drop index on chat_session.project_id
try:
indexes = [ix.get("name") for ix in inspector.get_indexes("chat_session")]
if "ix_chat_session_project_id" in indexes:
op.drop_index("ix_chat_session_project_id", "chat_session")
logger.info("Dropped index on chat_session.project_id")
except Exception:
pass
# Drop foreign key constraint
try:
chat_session_fks = inspector.get_foreign_keys("chat_session")
fk_exists = any(
fk["name"] == "fk_chat_session_project_id" for fk in chat_session_fks
)
if fk_exists:
op.drop_constraint(
"fk_chat_session_project_id", "chat_session", type_="foreignkey"
)
logger.info("Dropped foreign key constraint on chat_session.project_id")
except Exception:
pass
# Clear project__user_file relationships (but keep the table for migration 1 to handle)
if "project__user_file" in inspector.get_table_names():
result = bind.execute(text("DELETE FROM project__user_file"))
logger.info(f"Cleared {result.rowcount} records from project__user_file")
# Remove migrated persona__user_file relationships
# Only remove those that came from folder relationships
if all(
table in inspector.get_table_names()
for table in ["persona__user_file", "persona__user_folder", "user_file"]
):
user_file_columns = [col["name"] for col in inspector.get_columns("user_file")]
if "folder_id" in user_file_columns:
result = bind.execute(
text(
"""
DELETE FROM persona__user_file puf
WHERE EXISTS (
SELECT 1
FROM user_file uf
JOIN persona__user_folder puf2
ON puf2.user_folder_id = uf.folder_id
WHERE puf.persona_id = puf2.persona_id
AND puf.user_file_id = uf.id
)
"""
)
)
logger.info(
f"Removed {result.rowcount} migrated persona__user_file relationships"
)
logger.info("Downgrade completed successfully")

View File

@@ -0,0 +1,45 @@
"""Add foreign key to user__external_user_group_id
Revision ID: 238b84885828
Revises: a7688ab35c45
Create Date: 2025-05-19 17:15:33.424584
"""
from alembic import op
# revision identifiers, used by Alembic.
revision = "238b84885828"
down_revision = "a7688ab35c45"
branch_labels = None
depends_on = None
def upgrade() -> None:
# First, clean up any entries that don't have a valid cc_pair_id
op.execute(
"""
DELETE FROM user__external_user_group_id
WHERE cc_pair_id NOT IN (SELECT id FROM connector_credential_pair)
"""
)
# Add foreign key constraint with cascade delete
op.create_foreign_key(
"fk_user__external_user_group_id_cc_pair_id",
"user__external_user_group_id",
"connector_credential_pair",
["cc_pair_id"],
["id"],
ondelete="CASCADE",
)
def downgrade() -> None:
# Drop the foreign key constraint
op.drop_constraint(
"fk_user__external_user_group_id_cc_pair_id",
"user__external_user_group_id",
type_="foreignkey",
)

View File

@@ -144,27 +144,34 @@ def upgrade() -> None:
def downgrade() -> None:
op.execute("TRUNCATE TABLE index_attempt")
op.add_column(
"index_attempt",
sa.Column("input_type", sa.VARCHAR(), autoincrement=False, nullable=False),
)
op.add_column(
"index_attempt",
sa.Column("source", sa.VARCHAR(), autoincrement=False, nullable=False),
)
op.add_column(
"index_attempt",
sa.Column(
"connector_specific_config",
postgresql.JSONB(astext_type=sa.Text()),
autoincrement=False,
nullable=False,
),
)
# Check if the constraint exists before dropping
conn = op.get_bind()
inspector = sa.inspect(conn)
existing_columns = {col["name"] for col in inspector.get_columns("index_attempt")}
if "input_type" not in existing_columns:
op.add_column(
"index_attempt",
sa.Column("input_type", sa.VARCHAR(), autoincrement=False, nullable=False),
)
if "source" not in existing_columns:
op.add_column(
"index_attempt",
sa.Column("source", sa.VARCHAR(), autoincrement=False, nullable=False),
)
if "connector_specific_config" not in existing_columns:
op.add_column(
"index_attempt",
sa.Column(
"connector_specific_config",
postgresql.JSONB(astext_type=sa.Text()),
autoincrement=False,
nullable=False,
),
)
# Check if the constraint exists before dropping
constraints = inspector.get_foreign_keys("index_attempt")
if any(
@@ -183,8 +190,12 @@ def downgrade() -> None:
"fk_index_attempt_connector_id", "index_attempt", type_="foreignkey"
)
op.drop_column("index_attempt", "credential_id")
op.drop_column("index_attempt", "connector_id")
op.drop_table("connector_credential_pair")
op.drop_table("credential")
op.drop_table("connector")
if "credential_id" in existing_columns:
op.drop_column("index_attempt", "credential_id")
if "connector_id" in existing_columns:
op.drop_column("index_attempt", "connector_id")
op.execute("DROP TABLE IF EXISTS connector_credential_pair CASCADE")
op.execute("DROP TABLE IF EXISTS credential CASCADE")
op.execute("DROP TABLE IF EXISTS connector CASCADE")

View File

@@ -0,0 +1,218 @@
"""Migration 6: User file schema cleanup
Revision ID: 2b75d0a8ffcb
Revises: 3a78dba1080a
Create Date: 2025-09-22 10:09:26.375377
This migration removes legacy columns and tables after data migration is complete.
It should only be run after verifying all data has been successfully migrated.
"""
from alembic import op
import sqlalchemy as sa
from sqlalchemy import text
import logging
logger = logging.getLogger("alembic.runtime.migration")
# revision identifiers, used by Alembic.
revision = "2b75d0a8ffcb"
down_revision = "3a78dba1080a"
branch_labels = None
depends_on = None
def upgrade() -> None:
"""Remove legacy columns and tables."""
bind = op.get_bind()
inspector = sa.inspect(bind)
logger.info("Starting schema cleanup...")
# === Step 1: Verify data migration is complete ===
logger.info("Verifying data migration completion...")
# Check if any chat sessions still have folder_id references
chat_session_columns = [
col["name"] for col in inspector.get_columns("chat_session")
]
if "folder_id" in chat_session_columns:
orphaned_count = bind.execute(
text(
"""
SELECT COUNT(*) FROM chat_session
WHERE folder_id IS NOT NULL AND project_id IS NULL
"""
)
).scalar_one()
if orphaned_count > 0:
logger.warning(
f"WARNING: {orphaned_count} chat_session records still have "
f"folder_id without project_id. Proceeding anyway."
)
# === Step 2: Drop chat_session.folder_id ===
if "folder_id" in chat_session_columns:
logger.info("Dropping chat_session.folder_id...")
# Drop foreign key constraint first
op.execute(
"ALTER TABLE chat_session DROP CONSTRAINT IF EXISTS chat_session_folder_fk"
)
# Drop the column
op.drop_column("chat_session", "folder_id")
logger.info("Dropped chat_session.folder_id")
# === Step 3: Drop persona__user_folder table ===
if "persona__user_folder" in inspector.get_table_names():
logger.info("Dropping persona__user_folder table...")
# Check for any remaining data
remaining = bind.execute(
text("SELECT COUNT(*) FROM persona__user_folder")
).scalar_one()
if remaining > 0:
logger.warning(
f"WARNING: Dropping persona__user_folder with {remaining} records"
)
op.drop_table("persona__user_folder")
logger.info("Dropped persona__user_folder table")
# === Step 4: Drop chat_folder table ===
if "chat_folder" in inspector.get_table_names():
logger.info("Dropping chat_folder table...")
# Check for any remaining data
remaining = bind.execute(text("SELECT COUNT(*) FROM chat_folder")).scalar_one()
if remaining > 0:
logger.warning(f"WARNING: Dropping chat_folder with {remaining} records")
op.drop_table("chat_folder")
logger.info("Dropped chat_folder table")
# === Step 5: Drop user_file legacy columns ===
user_file_columns = [col["name"] for col in inspector.get_columns("user_file")]
# Drop folder_id
if "folder_id" in user_file_columns:
logger.info("Dropping user_file.folder_id...")
op.drop_column("user_file", "folder_id")
logger.info("Dropped user_file.folder_id")
# Drop cc_pair_id (already handled in migration 5, but be sure)
if "cc_pair_id" in user_file_columns:
logger.info("Dropping user_file.cc_pair_id...")
# Drop any remaining foreign key constraints
bind.execute(
text(
"""
DO $$
DECLARE r RECORD;
BEGIN
FOR r IN (
SELECT conname
FROM pg_constraint c
JOIN pg_class t ON c.conrelid = t.oid
WHERE c.contype = 'f'
AND t.relname = 'user_file'
AND EXISTS (
SELECT 1 FROM pg_attribute a
WHERE a.attrelid = t.oid
AND a.attname = 'cc_pair_id'
)
) LOOP
EXECUTE format('ALTER TABLE user_file DROP CONSTRAINT IF EXISTS %I', r.conname);
END LOOP;
END$$;
"""
)
)
op.drop_column("user_file", "cc_pair_id")
logger.info("Dropped user_file.cc_pair_id")
# === Step 6: Clean up any remaining constraints ===
logger.info("Cleaning up remaining constraints...")
# Drop any unique constraints on removed columns
op.execute(
"ALTER TABLE user_file DROP CONSTRAINT IF EXISTS user_file_cc_pair_id_key"
)
logger.info("Migration 6 (schema cleanup) completed successfully")
logger.info("Legacy schema has been fully removed")
def downgrade() -> None:
"""Recreate dropped columns and tables (structure only, no data)."""
bind = op.get_bind()
inspector = sa.inspect(bind)
logger.warning("Downgrading schema cleanup - recreating structure only, no data!")
# Recreate user_file columns
if "user_file" in inspector.get_table_names():
columns = [col["name"] for col in inspector.get_columns("user_file")]
if "cc_pair_id" not in columns:
op.add_column(
"user_file", sa.Column("cc_pair_id", sa.Integer(), nullable=True)
)
if "folder_id" not in columns:
op.add_column(
"user_file", sa.Column("folder_id", sa.Integer(), nullable=True)
)
# Recreate chat_folder table
if "chat_folder" not in inspector.get_table_names():
op.create_table(
"chat_folder",
sa.Column("id", sa.Integer(), nullable=False),
sa.Column("user_id", sa.UUID(), nullable=False),
sa.Column("name", sa.String(), nullable=False),
sa.Column("created_at", sa.DateTime(timezone=True), nullable=False),
sa.PrimaryKeyConstraint("id"),
sa.ForeignKeyConstraint(
["user_id"], ["user.id"], name="chat_folder_user_fk"
),
)
# Recreate persona__user_folder table
if "persona__user_folder" not in inspector.get_table_names():
op.create_table(
"persona__user_folder",
sa.Column("persona_id", sa.Integer(), nullable=False),
sa.Column("user_folder_id", sa.Integer(), nullable=False),
sa.PrimaryKeyConstraint("persona_id", "user_folder_id"),
sa.ForeignKeyConstraint(["persona_id"], ["persona.id"]),
sa.ForeignKeyConstraint(["user_folder_id"], ["user_project.id"]),
)
# Add folder_id back to chat_session
if "chat_session" in inspector.get_table_names():
columns = [col["name"] for col in inspector.get_columns("chat_session")]
if "folder_id" not in columns:
op.add_column(
"chat_session", sa.Column("folder_id", sa.Integer(), nullable=True)
)
# Add foreign key if chat_folder exists
if "chat_folder" in inspector.get_table_names():
op.create_foreign_key(
"chat_session_folder_fk",
"chat_session",
"chat_folder",
["folder_id"],
["id"],
)
logger.info("Downgrade completed - structure recreated but data is lost")

View File

@@ -0,0 +1,115 @@
"""add_indexing_coordination
Revision ID: 2f95e36923e6
Revises: 0816326d83aa
Create Date: 2025-07-10 16:17:57.762182
"""
from alembic import op
import sqlalchemy as sa
# revision identifiers, used by Alembic.
revision = "2f95e36923e6"
down_revision = "0816326d83aa"
branch_labels = None
depends_on = None
def upgrade() -> None:
# Add database-based coordination fields (replacing Redis fencing)
op.add_column(
"index_attempt", sa.Column("celery_task_id", sa.String(), nullable=True)
)
op.add_column(
"index_attempt",
sa.Column(
"cancellation_requested",
sa.Boolean(),
nullable=False,
server_default="false",
),
)
# Add batch coordination fields (replacing FileStore state)
op.add_column(
"index_attempt", sa.Column("total_batches", sa.Integer(), nullable=True)
)
op.add_column(
"index_attempt",
sa.Column(
"completed_batches", sa.Integer(), nullable=False, server_default="0"
),
)
op.add_column(
"index_attempt",
sa.Column(
"total_failures_batch_level",
sa.Integer(),
nullable=False,
server_default="0",
),
)
op.add_column(
"index_attempt",
sa.Column("total_chunks", sa.Integer(), nullable=False, server_default="0"),
)
# Progress tracking for stall detection
op.add_column(
"index_attempt",
sa.Column("last_progress_time", sa.DateTime(timezone=True), nullable=True),
)
op.add_column(
"index_attempt",
sa.Column(
"last_batches_completed_count",
sa.Integer(),
nullable=False,
server_default="0",
),
)
# Heartbeat tracking for worker liveness detection
op.add_column(
"index_attempt",
sa.Column(
"heartbeat_counter", sa.Integer(), nullable=False, server_default="0"
),
)
op.add_column(
"index_attempt",
sa.Column(
"last_heartbeat_value", sa.Integer(), nullable=False, server_default="0"
),
)
op.add_column(
"index_attempt",
sa.Column("last_heartbeat_time", sa.DateTime(timezone=True), nullable=True),
)
# Add index for coordination queries
op.create_index(
"ix_index_attempt_active_coordination",
"index_attempt",
["connector_credential_pair_id", "search_settings_id", "status"],
)
def downgrade() -> None:
# Remove the new index
op.drop_index("ix_index_attempt_active_coordination", table_name="index_attempt")
# Remove the new columns
op.drop_column("index_attempt", "last_batches_completed_count")
op.drop_column("index_attempt", "last_progress_time")
op.drop_column("index_attempt", "last_heartbeat_time")
op.drop_column("index_attempt", "last_heartbeat_value")
op.drop_column("index_attempt", "heartbeat_counter")
op.drop_column("index_attempt", "total_chunks")
op.drop_column("index_attempt", "total_failures_batch_level")
op.drop_column("index_attempt", "completed_batches")
op.drop_column("index_attempt", "total_batches")
op.drop_column("index_attempt", "cancellation_requested")
op.drop_column("index_attempt", "celery_task_id")

View File

@@ -0,0 +1,136 @@
"""update_kg_trigger_functions
Revision ID: 36e9220ab794
Revises: c9e2cd766c29
Create Date: 2025-06-22 17:33:25.833733
"""
from alembic import op
from sqlalchemy.orm import Session
from sqlalchemy import text
from shared_configs.configs import POSTGRES_DEFAULT_SCHEMA
# revision identifiers, used by Alembic.
revision = "36e9220ab794"
down_revision = "c9e2cd766c29"
branch_labels = None
depends_on = None
def _get_tenant_contextvar(session: Session) -> str:
"""Get the current schema for the migration"""
current_tenant = session.execute(text("SELECT current_schema()")).scalar()
if isinstance(current_tenant, str):
return current_tenant
else:
raise ValueError("Current tenant is not a string")
def upgrade() -> None:
bind = op.get_bind()
session = Session(bind=bind)
# Create kg_entity trigger to update kg_entity.name and its trigrams
tenant_id = _get_tenant_contextvar(session)
alphanum_pattern = r"[^a-z0-9]+"
truncate_length = 1000
function = "update_kg_entity_name"
op.execute(
text(
f"""
CREATE OR REPLACE FUNCTION "{tenant_id}".{function}()
RETURNS TRIGGER AS $$
DECLARE
name text;
cleaned_name text;
BEGIN
-- Set name to semantic_id if document_id is not NULL
IF NEW.document_id IS NOT NULL THEN
SELECT lower(semantic_id) INTO name
FROM "{tenant_id}".document
WHERE id = NEW.document_id;
ELSE
name = lower(NEW.name);
END IF;
-- Clean name and truncate if too long
cleaned_name = regexp_replace(
name,
'{alphanum_pattern}', '', 'g'
);
IF length(cleaned_name) > {truncate_length} THEN
cleaned_name = left(cleaned_name, {truncate_length});
END IF;
-- Set name and name trigrams
NEW.name = name;
NEW.name_trigrams = {POSTGRES_DEFAULT_SCHEMA}.show_trgm(cleaned_name);
RETURN NEW;
END;
$$ LANGUAGE plpgsql;
"""
)
)
trigger = f"{function}_trigger"
op.execute(f'DROP TRIGGER IF EXISTS {trigger} ON "{tenant_id}".kg_entity')
op.execute(
f"""
CREATE TRIGGER {trigger}
BEFORE INSERT OR UPDATE OF name
ON "{tenant_id}".kg_entity
FOR EACH ROW
EXECUTE FUNCTION "{tenant_id}".{function}();
"""
)
# Create kg_entity trigger to update kg_entity.name and its trigrams
function = "update_kg_entity_name_from_doc"
op.execute(
text(
f"""
CREATE OR REPLACE FUNCTION "{tenant_id}".{function}()
RETURNS TRIGGER AS $$
DECLARE
doc_name text;
cleaned_name text;
BEGIN
doc_name = lower(NEW.semantic_id);
-- Clean name and truncate if too long
cleaned_name = regexp_replace(
doc_name,
'{alphanum_pattern}', '', 'g'
);
IF length(cleaned_name) > {truncate_length} THEN
cleaned_name = left(cleaned_name, {truncate_length});
END IF;
-- Set name and name trigrams for all entities referencing this document
UPDATE "{tenant_id}".kg_entity
SET
name = doc_name,
name_trigrams = {POSTGRES_DEFAULT_SCHEMA}.show_trgm(cleaned_name)
WHERE document_id = NEW.id;
RETURN NEW;
END;
$$ LANGUAGE plpgsql;
"""
)
)
trigger = f"{function}_trigger"
op.execute(f'DROP TRIGGER IF EXISTS {trigger} ON "{tenant_id}".document')
op.execute(
f"""
CREATE TRIGGER {trigger}
AFTER UPDATE OF semantic_id
ON "{tenant_id}".document
FOR EACH ROW
EXECUTE FUNCTION "{tenant_id}".{function}();
"""
)
def downgrade() -> None:
pass

View File

@@ -0,0 +1,298 @@
"""Migration 5: User file legacy data cleanup
Revision ID: 3a78dba1080a
Revises: 7cc3fcc116c1
Create Date: 2025-09-22 10:04:27.986294
This migration removes legacy user-file documents and connector_credential_pairs.
It performs bulk deletions of obsolete data after the UUID migration.
"""
from alembic import op
import sqlalchemy as sa
from sqlalchemy.dialects import postgresql as psql
from sqlalchemy import text
import logging
from typing import List
import uuid
logger = logging.getLogger("alembic.runtime.migration")
# revision identifiers, used by Alembic.
revision = "3a78dba1080a"
down_revision = "7cc3fcc116c1"
branch_labels = None
depends_on = None
def batch_delete(
bind: sa.engine.Connection,
table_name: str,
id_column: str,
ids: List[str | int | uuid.UUID],
batch_size: int = 1000,
id_type: str = "int",
) -> int:
"""Delete records in batches to avoid memory issues and timeouts."""
total_count = len(ids)
if total_count == 0:
return 0
logger.info(
f"Starting batch deletion of {total_count} records from {table_name}..."
)
# Determine appropriate ARRAY type
if id_type == "uuid":
array_type = psql.ARRAY(psql.UUID(as_uuid=True))
elif id_type == "int":
array_type = psql.ARRAY(sa.Integer())
else:
array_type = psql.ARRAY(sa.String())
total_deleted = 0
failed_batches = []
for i in range(0, total_count, batch_size):
batch_ids = ids[i : i + batch_size]
try:
stmt = text(
f"DELETE FROM {table_name} WHERE {id_column} = ANY(:ids)"
).bindparams(sa.bindparam("ids", value=batch_ids, type_=array_type))
result = bind.execute(stmt)
total_deleted += result.rowcount
# Log progress every 10 batches or at completion
batch_num = (i // batch_size) + 1
if batch_num % 10 == 0 or i + batch_size >= total_count:
logger.info(
f" Deleted {min(i + batch_size, total_count)}/{total_count} records "
f"({total_deleted} actual) from {table_name}"
)
except Exception as e:
logger.error(f"Failed to delete batch {(i // batch_size) + 1}: {e}")
failed_batches.append((i, min(i + batch_size, total_count)))
if failed_batches:
logger.warning(
f"Failed to delete {len(failed_batches)} batches from {table_name}. "
f"Total deleted: {total_deleted}/{total_count}"
)
# Fail the migration to avoid silently succeeding on partial cleanup
raise RuntimeError(
f"Batch deletion failed for {table_name}: "
f"{len(failed_batches)} failed batches out of "
f"{(total_count + batch_size - 1) // batch_size}."
)
return total_deleted
def upgrade() -> None:
"""Remove legacy user-file documents and connector_credential_pairs."""
bind = op.get_bind()
inspector = sa.inspect(bind)
logger.info("Starting legacy data cleanup...")
# === Step 1: Identify and delete user-file documents ===
logger.info("Identifying user-file documents to delete...")
# Get document IDs to delete
doc_rows = bind.execute(
text(
"""
SELECT DISTINCT dcc.id AS document_id
FROM document_by_connector_credential_pair dcc
JOIN connector_credential_pair u
ON u.connector_id = dcc.connector_id
AND u.credential_id = dcc.credential_id
WHERE u.is_user_file IS TRUE
"""
)
).fetchall()
doc_ids = [r[0] for r in doc_rows]
if doc_ids:
logger.info(f"Found {len(doc_ids)} user-file documents to delete")
# Delete dependent rows first
tables_to_clean = [
("document_retrieval_feedback", "document_id"),
("document__tag", "document_id"),
("chunk_stats", "document_id"),
]
for table_name, column_name in tables_to_clean:
if table_name in inspector.get_table_names():
# document_id is a string in these tables
deleted = batch_delete(
bind, table_name, column_name, doc_ids, id_type="str"
)
logger.info(f"Deleted {deleted} records from {table_name}")
# Delete document_by_connector_credential_pair entries
deleted = batch_delete(
bind, "document_by_connector_credential_pair", "id", doc_ids, id_type="str"
)
logger.info(f"Deleted {deleted} document_by_connector_credential_pair records")
# Delete documents themselves
deleted = batch_delete(bind, "document", "id", doc_ids, id_type="str")
logger.info(f"Deleted {deleted} document records")
else:
logger.info("No user-file documents found to delete")
# === Step 2: Clean up user-file connector_credential_pairs ===
logger.info("Cleaning up user-file connector_credential_pairs...")
# Get cc_pair IDs
cc_pair_rows = bind.execute(
text(
"""
SELECT id AS cc_pair_id
FROM connector_credential_pair
WHERE is_user_file IS TRUE
"""
)
).fetchall()
cc_pair_ids = [r[0] for r in cc_pair_rows]
if cc_pair_ids:
logger.info(
f"Found {len(cc_pair_ids)} user-file connector_credential_pairs to clean up"
)
# Delete related records
# Clean child tables first to satisfy foreign key constraints,
# then the parent tables
tables_to_clean = [
("index_attempt_errors", "connector_credential_pair_id"),
("index_attempt", "connector_credential_pair_id"),
("background_error", "cc_pair_id"),
("document_set__connector_credential_pair", "connector_credential_pair_id"),
("user_group__connector_credential_pair", "cc_pair_id"),
]
for table_name, column_name in tables_to_clean:
if table_name in inspector.get_table_names():
deleted = batch_delete(
bind, table_name, column_name, cc_pair_ids, id_type="int"
)
logger.info(f"Deleted {deleted} records from {table_name}")
# === Step 3: Identify connectors and credentials to delete ===
logger.info("Identifying orphaned connectors and credentials...")
# Get connectors used only by user-file cc_pairs
connector_rows = bind.execute(
text(
"""
SELECT DISTINCT ccp.connector_id
FROM connector_credential_pair ccp
WHERE ccp.is_user_file IS TRUE
AND ccp.connector_id != 0 -- Exclude system default
AND NOT EXISTS (
SELECT 1
FROM connector_credential_pair c2
WHERE c2.connector_id = ccp.connector_id
AND c2.is_user_file IS NOT TRUE
)
"""
)
).fetchall()
userfile_only_connector_ids = [r[0] for r in connector_rows]
# Get credentials used only by user-file cc_pairs
credential_rows = bind.execute(
text(
"""
SELECT DISTINCT ccp.credential_id
FROM connector_credential_pair ccp
WHERE ccp.is_user_file IS TRUE
AND ccp.credential_id != 0 -- Exclude public/default
AND NOT EXISTS (
SELECT 1
FROM connector_credential_pair c2
WHERE c2.credential_id = ccp.credential_id
AND c2.is_user_file IS NOT TRUE
)
"""
)
).fetchall()
userfile_only_credential_ids = [r[0] for r in credential_rows]
# === Step 4: Delete the cc_pairs themselves ===
if cc_pair_ids:
# Remove FK dependency from user_file first
bind.execute(
text(
"""
DO $$
DECLARE r RECORD;
BEGIN
FOR r IN (
SELECT conname
FROM pg_constraint c
JOIN pg_class t ON c.conrelid = t.oid
JOIN pg_class ft ON c.confrelid = ft.oid
WHERE c.contype = 'f'
AND t.relname = 'user_file'
AND ft.relname = 'connector_credential_pair'
) LOOP
EXECUTE format('ALTER TABLE user_file DROP CONSTRAINT IF EXISTS %I', r.conname);
END LOOP;
END$$;
"""
)
)
# Delete cc_pairs
deleted = batch_delete(
bind, "connector_credential_pair", "id", cc_pair_ids, id_type="int"
)
logger.info(f"Deleted {deleted} connector_credential_pair records")
# === Step 5: Delete orphaned connectors ===
if userfile_only_connector_ids:
deleted = batch_delete(
bind, "connector", "id", userfile_only_connector_ids, id_type="int"
)
logger.info(f"Deleted {deleted} orphaned connector records")
# === Step 6: Delete orphaned credentials ===
if userfile_only_credential_ids:
# Clean up credential__user_group mappings first
deleted = batch_delete(
bind,
"credential__user_group",
"credential_id",
userfile_only_credential_ids,
id_type="int",
)
logger.info(f"Deleted {deleted} credential__user_group records")
# Delete credentials
deleted = batch_delete(
bind, "credential", "id", userfile_only_credential_ids, id_type="int"
)
logger.info(f"Deleted {deleted} orphaned credential records")
logger.info("Migration 5 (legacy data cleanup) completed successfully")
def downgrade() -> None:
"""Cannot restore deleted data - requires backup restoration."""
logger.error("CRITICAL: Downgrading data cleanup cannot restore deleted data!")
logger.error("Data restoration requires backup files or database backup.")
raise NotImplementedError(
"Downgrade of legacy data cleanup is not supported. "
"Deleted data must be restored from backups."
)

View File

@@ -21,22 +21,14 @@ depends_on = None
# an outage by creating an index without using CONCURRENTLY. This migration:
#
# 1. Creates more efficient full-text search capabilities using tsvector columns and GIN indexes
# 2. Uses CONCURRENTLY for all index creation to prevent table locking
# 3. Explicitly manages transactions with COMMIT statements to allow CONCURRENTLY to work
# (see: https://www.postgresql.org/docs/9.4/sql-createindex.html#SQL-CREATEINDEX-CONCURRENTLY)
# (see: https://github.com/sqlalchemy/alembic/issues/277)
# 4. Adds indexes to both chat_message and chat_session tables for comprehensive search
# 2. Adds indexes to both chat_message and chat_session tables for comprehensive search
# 3. Note: CONCURRENTLY was removed due to operational issues
def upgrade() -> None:
# First, drop any existing indexes to avoid conflicts
op.execute("COMMIT")
op.execute("DROP INDEX CONCURRENTLY IF EXISTS idx_chat_message_tsv;")
op.execute("COMMIT")
op.execute("DROP INDEX CONCURRENTLY IF EXISTS idx_chat_session_desc_tsv;")
op.execute("COMMIT")
op.execute("DROP INDEX IF EXISTS idx_chat_message_tsv;")
op.execute("DROP INDEX IF EXISTS idx_chat_session_desc_tsv;")
op.execute("DROP INDEX IF EXISTS idx_chat_message_message_lower;")
# Drop existing columns if they exist
@@ -52,12 +44,9 @@ def upgrade() -> None:
"""
)
# Commit the current transaction before creating concurrent indexes
op.execute("COMMIT")
op.execute(
"""
CREATE INDEX CONCURRENTLY IF NOT EXISTS idx_chat_message_tsv
CREATE INDEX IF NOT EXISTS idx_chat_message_tsv
ON chat_message
USING GIN (message_tsv)
"""
@@ -72,12 +61,9 @@ def upgrade() -> None:
"""
)
# Commit again before creating the second concurrent index
op.execute("COMMIT")
op.execute(
"""
CREATE INDEX CONCURRENTLY IF NOT EXISTS idx_chat_session_desc_tsv
CREATE INDEX IF NOT EXISTS idx_chat_session_desc_tsv
ON chat_session
USING GIN (description_tsv)
"""
@@ -85,12 +71,9 @@ def upgrade() -> None:
def downgrade() -> None:
# Drop the indexes first (use CONCURRENTLY for dropping too)
op.execute("COMMIT")
op.execute("DROP INDEX CONCURRENTLY IF EXISTS idx_chat_message_tsv;")
op.execute("COMMIT")
op.execute("DROP INDEX CONCURRENTLY IF EXISTS idx_chat_session_desc_tsv;")
# Drop the indexes first
op.execute("DROP INDEX IF EXISTS idx_chat_message_tsv;")
op.execute("DROP INDEX IF EXISTS idx_chat_session_desc_tsv;")
# Then drop the columns
op.execute("ALTER TABLE chat_message DROP COLUMN IF EXISTS message_tsv;")

View File

@@ -0,0 +1,30 @@
"""add_doc_metadata_field_in_document_model
Revision ID: 3fc5d75723b3
Revises: 2f95e36923e6
Create Date: 2025-07-28 18:45:37.985406
"""
from alembic import op
import sqlalchemy as sa
from sqlalchemy.dialects import postgresql
# revision identifiers, used by Alembic.
revision = "3fc5d75723b3"
down_revision = "2f95e36923e6"
branch_labels = None
depends_on = None
def upgrade() -> None:
op.add_column(
"document",
sa.Column(
"doc_metadata", postgresql.JSONB(astext_type=sa.Text()), nullable=True
),
)
def downgrade() -> None:
op.drop_column("document", "doc_metadata")

View File

@@ -0,0 +1,28 @@
"""reset userfile document_id_migrated field
Revision ID: 40926a4dab77
Revises: 64bd5677aeb6
Create Date: 2025-10-06 16:10:32.898668
"""
from alembic import op
# revision identifiers, used by Alembic.
revision = "40926a4dab77"
down_revision = "64bd5677aeb6"
branch_labels = None
depends_on = None
def upgrade() -> None:
# Set all existing records to not migrated
op.execute(
"UPDATE user_file SET document_id_migrated = FALSE "
"WHERE document_id_migrated IS DISTINCT FROM FALSE;"
)
def downgrade() -> None:
# No-op
pass

View File

@@ -0,0 +1,150 @@
"""Fix invalid model-configurations state
Revision ID: 47a07e1a38f1
Revises: 7a70b7664e37
Create Date: 2025-04-23 15:39:43.159504
"""
from alembic import op
from pydantic import BaseModel, ConfigDict
import sqlalchemy as sa
from sqlalchemy.dialects import postgresql
from onyx.llm.llm_provider_options import (
fetch_model_names_for_provider_as_set,
fetch_visible_model_names_for_provider_as_set,
)
# revision identifiers, used by Alembic.
revision = "47a07e1a38f1"
down_revision = "7a70b7664e37"
branch_labels = None
depends_on = None
class _SimpleModelConfiguration(BaseModel):
# Configure model to read from attributes
model_config = ConfigDict(from_attributes=True)
id: int
llm_provider_id: int
name: str
is_visible: bool
max_input_tokens: int | None
def upgrade() -> None:
llm_provider_table = sa.sql.table(
"llm_provider",
sa.column("id", sa.Integer),
sa.column("provider", sa.String),
sa.column("model_names", postgresql.ARRAY(sa.String)),
sa.column("display_model_names", postgresql.ARRAY(sa.String)),
sa.column("default_model_name", sa.String),
sa.column("fast_default_model_name", sa.String),
)
model_configuration_table = sa.sql.table(
"model_configuration",
sa.column("id", sa.Integer),
sa.column("llm_provider_id", sa.Integer),
sa.column("name", sa.String),
sa.column("is_visible", sa.Boolean),
sa.column("max_input_tokens", sa.Integer),
)
connection = op.get_bind()
llm_providers = connection.execute(
sa.select(
llm_provider_table.c.id,
llm_provider_table.c.provider,
)
).fetchall()
for llm_provider in llm_providers:
llm_provider_id, provider_name = llm_provider
default_models = fetch_model_names_for_provider_as_set(provider_name)
display_models = fetch_visible_model_names_for_provider_as_set(
provider_name=provider_name
)
# if `fetch_model_names_for_provider_as_set` returns `None`, then
# that means that `provider_name` is not a well-known llm provider.
if not default_models:
continue
if not display_models:
raise RuntimeError(
"If `default_models` is non-None, `display_models` must be non-None too."
)
model_configurations = [
_SimpleModelConfiguration.model_validate(model_configuration)
for model_configuration in connection.execute(
sa.select(
model_configuration_table.c.id,
model_configuration_table.c.llm_provider_id,
model_configuration_table.c.name,
model_configuration_table.c.is_visible,
model_configuration_table.c.max_input_tokens,
).where(model_configuration_table.c.llm_provider_id == llm_provider_id)
).fetchall()
]
if model_configurations:
at_least_one_is_visible = any(
[
model_configuration.is_visible
for model_configuration in model_configurations
]
)
# If there is at least one model which is public, this is a valid state.
# Therefore, don't touch it and move on to the next one.
if at_least_one_is_visible:
continue
existing_visible_model_names: set[str] = set(
[
model_configuration.name
for model_configuration in model_configurations
if model_configuration.is_visible
]
)
difference = display_models.difference(existing_visible_model_names)
for model_name in difference:
if not model_name:
continue
insert_statement = postgresql.insert(model_configuration_table).values(
llm_provider_id=llm_provider_id,
name=model_name,
is_visible=True,
max_input_tokens=None,
)
connection.execute(
insert_statement.on_conflict_do_update(
index_elements=["llm_provider_id", "name"],
set_={"is_visible": insert_statement.excluded.is_visible},
)
)
else:
for model_name in default_models:
connection.execute(
model_configuration_table.insert().values(
llm_provider_id=llm_provider_id,
name=model_name,
is_visible=model_name in display_models,
max_input_tokens=None,
)
)
def downgrade() -> None:
pass

View File

@@ -0,0 +1,691 @@
"""create knowledge graph tables
Revision ID: 495cb26ce93e
Revises: ca04500b9ee8
Create Date: 2025-03-19 08:51:14.341989
"""
from alembic import op
import sqlalchemy as sa
from sqlalchemy.dialects import postgresql
from sqlalchemy import text
from datetime import datetime, timedelta
from onyx.configs.app_configs import DB_READONLY_USER
from onyx.configs.app_configs import DB_READONLY_PASSWORD
from shared_configs.configs import MULTI_TENANT
from shared_configs.configs import POSTGRES_DEFAULT_SCHEMA
# revision identifiers, used by Alembic.
revision = "495cb26ce93e"
down_revision = "ca04500b9ee8"
branch_labels = None
depends_on = None
def upgrade() -> None:
# Create a new permission-less user to be later used for knowledge graph queries.
# The user will later get temporary read privileges for a specific view that will be
# ad hoc generated specific to a knowledge graph query.
#
# Note: in order for the migration to run, the DB_READONLY_USER and DB_READONLY_PASSWORD
# environment variables MUST be set. Otherwise, an exception will be raised.
if not MULTI_TENANT:
# Enable pg_trgm extension if not already enabled
op.execute("CREATE EXTENSION IF NOT EXISTS pg_trgm")
# Create read-only db user here only in single tenant mode. For multi-tenant mode,
# the user is created in the alembic_tenants migration.
if not (DB_READONLY_USER and DB_READONLY_PASSWORD):
raise Exception("DB_READONLY_USER or DB_READONLY_PASSWORD is not set")
op.execute(
text(
f"""
DO $$
BEGIN
-- Check if the read-only user already exists
IF NOT EXISTS (SELECT FROM pg_catalog.pg_roles WHERE rolname = '{DB_READONLY_USER}') THEN
-- Create the read-only user with the specified password
EXECUTE format('CREATE USER %I WITH PASSWORD %L', '{DB_READONLY_USER}', '{DB_READONLY_PASSWORD}');
-- First revoke all privileges to ensure a clean slate
EXECUTE format('REVOKE ALL ON DATABASE %I FROM %I', current_database(), '{DB_READONLY_USER}');
-- Grant only the CONNECT privilege to allow the user to connect to the database
-- but not perform any operations without additional specific grants
EXECUTE format('GRANT CONNECT ON DATABASE %I TO %I', current_database(), '{DB_READONLY_USER}');
END IF;
END
$$;
"""
)
)
# Grant usage on current schema to readonly user
op.execute(
text(
f"""
DO $$
BEGIN
IF EXISTS (SELECT FROM pg_catalog.pg_roles WHERE rolname = '{DB_READONLY_USER}') THEN
EXECUTE format('GRANT USAGE ON SCHEMA %I TO %I', current_schema(), '{DB_READONLY_USER}');
END IF;
END
$$;
"""
)
)
op.execute("DROP TABLE IF EXISTS kg_config CASCADE")
op.create_table(
"kg_config",
sa.Column("id", sa.Integer(), primary_key=True, nullable=False, index=True),
sa.Column("kg_variable_name", sa.String(), nullable=False, index=True),
sa.Column("kg_variable_values", postgresql.ARRAY(sa.String()), nullable=False),
sa.UniqueConstraint("kg_variable_name", name="uq_kg_config_variable_name"),
)
# Insert initial data into kg_config table
op.bulk_insert(
sa.table(
"kg_config",
sa.column("kg_variable_name", sa.String),
sa.column("kg_variable_values", postgresql.ARRAY(sa.String)),
),
[
{"kg_variable_name": "KG_EXPOSED", "kg_variable_values": ["false"]},
{"kg_variable_name": "KG_ENABLED", "kg_variable_values": ["false"]},
{"kg_variable_name": "KG_VENDOR", "kg_variable_values": []},
{"kg_variable_name": "KG_VENDOR_DOMAINS", "kg_variable_values": []},
{"kg_variable_name": "KG_IGNORE_EMAIL_DOMAINS", "kg_variable_values": []},
{
"kg_variable_name": "KG_EXTRACTION_IN_PROGRESS",
"kg_variable_values": ["false"],
},
{
"kg_variable_name": "KG_CLUSTERING_IN_PROGRESS",
"kg_variable_values": ["false"],
},
{
"kg_variable_name": "KG_COVERAGE_START",
"kg_variable_values": [
(datetime.now() - timedelta(days=90)).strftime("%Y-%m-%d")
],
},
{"kg_variable_name": "KG_MAX_COVERAGE_DAYS", "kg_variable_values": ["90"]},
{
"kg_variable_name": "KG_MAX_PARENT_RECURSION_DEPTH",
"kg_variable_values": ["2"],
},
],
)
op.execute("DROP TABLE IF EXISTS kg_entity_type CASCADE")
op.create_table(
"kg_entity_type",
sa.Column("id_name", sa.String(), primary_key=True, nullable=False, index=True),
sa.Column("description", sa.String(), nullable=True),
sa.Column("grounding", sa.String(), nullable=False),
sa.Column(
"attributes",
postgresql.JSONB,
nullable=False,
server_default="{}",
),
sa.Column("occurrences", sa.Integer(), server_default="1", nullable=False),
sa.Column("active", sa.Boolean(), nullable=False, default=False),
sa.Column("deep_extraction", sa.Boolean(), nullable=False, default=False),
sa.Column(
"time_updated",
sa.DateTime(timezone=True),
server_default=sa.text("now()"),
onupdate=sa.text("now()"),
),
sa.Column(
"time_created", sa.DateTime(timezone=True), server_default=sa.text("now()")
),
sa.Column("grounded_source_name", sa.String(), nullable=True),
sa.Column("entity_values", postgresql.ARRAY(sa.String()), nullable=True),
sa.Column(
"clustering",
postgresql.JSONB,
nullable=False,
server_default="{}",
),
)
op.execute("DROP TABLE IF EXISTS kg_relationship_type CASCADE")
# Create KGRelationshipType table
op.create_table(
"kg_relationship_type",
sa.Column("id_name", sa.String(), primary_key=True, nullable=False, index=True),
sa.Column("name", sa.String(), nullable=False, index=True),
sa.Column(
"source_entity_type_id_name", sa.String(), nullable=False, index=True
),
sa.Column(
"target_entity_type_id_name", sa.String(), nullable=False, index=True
),
sa.Column("definition", sa.Boolean(), nullable=False, default=False),
sa.Column("occurrences", sa.Integer(), server_default="1", nullable=False),
sa.Column("type", sa.String(), nullable=False, index=True),
sa.Column("active", sa.Boolean(), nullable=False, default=True),
sa.Column(
"time_updated",
sa.DateTime(timezone=True),
server_default=sa.text("now()"),
onupdate=sa.text("now()"),
),
sa.Column(
"time_created", sa.DateTime(timezone=True), server_default=sa.text("now()")
),
sa.Column(
"clustering",
postgresql.JSONB,
nullable=False,
server_default="{}",
),
sa.ForeignKeyConstraint(
["source_entity_type_id_name"], ["kg_entity_type.id_name"]
),
sa.ForeignKeyConstraint(
["target_entity_type_id_name"], ["kg_entity_type.id_name"]
),
)
op.execute("DROP TABLE IF EXISTS kg_relationship_type_extraction_staging CASCADE")
# Create KGRelationshipTypeExtractionStaging table
op.create_table(
"kg_relationship_type_extraction_staging",
sa.Column("id_name", sa.String(), primary_key=True, nullable=False, index=True),
sa.Column("name", sa.String(), nullable=False, index=True),
sa.Column(
"source_entity_type_id_name", sa.String(), nullable=False, index=True
),
sa.Column(
"target_entity_type_id_name", sa.String(), nullable=False, index=True
),
sa.Column("definition", sa.Boolean(), nullable=False, default=False),
sa.Column("occurrences", sa.Integer(), server_default="1", nullable=False),
sa.Column("type", sa.String(), nullable=False, index=True),
sa.Column("active", sa.Boolean(), nullable=False, default=True),
sa.Column(
"time_created", sa.DateTime(timezone=True), server_default=sa.text("now()")
),
sa.Column(
"clustering",
postgresql.JSONB,
nullable=False,
server_default="{}",
),
sa.Column("transferred", sa.Boolean(), nullable=False, server_default="false"),
sa.ForeignKeyConstraint(
["source_entity_type_id_name"], ["kg_entity_type.id_name"]
),
sa.ForeignKeyConstraint(
["target_entity_type_id_name"], ["kg_entity_type.id_name"]
),
)
op.execute("DROP TABLE IF EXISTS kg_entity CASCADE")
# Create KGEntity table
op.create_table(
"kg_entity",
sa.Column("id_name", sa.String(), primary_key=True, nullable=False, index=True),
sa.Column("name", sa.String(), nullable=False, index=True),
sa.Column("entity_class", sa.String(), nullable=True, index=True),
sa.Column("entity_subtype", sa.String(), nullable=True, index=True),
sa.Column("entity_key", sa.String(), nullable=True, index=True),
sa.Column("name_trigrams", postgresql.ARRAY(sa.String(3)), nullable=True),
sa.Column("document_id", sa.String(), nullable=True, index=True),
sa.Column(
"alternative_names",
postgresql.ARRAY(sa.String()),
nullable=False,
server_default="{}",
),
sa.Column("entity_type_id_name", sa.String(), nullable=False, index=True),
sa.Column("description", sa.String(), nullable=True),
sa.Column(
"keywords",
postgresql.ARRAY(sa.String()),
nullable=False,
server_default="{}",
),
sa.Column("occurrences", sa.Integer(), server_default="1", nullable=False),
sa.Column(
"acl", postgresql.ARRAY(sa.String()), nullable=False, server_default="{}"
),
sa.Column("boosts", postgresql.JSONB, nullable=False, server_default="{}"),
sa.Column("attributes", postgresql.JSONB, nullable=False, server_default="{}"),
sa.Column("event_time", sa.DateTime(timezone=True), nullable=True),
sa.Column(
"time_updated",
sa.DateTime(timezone=True),
server_default=sa.text("now()"),
onupdate=sa.text("now()"),
),
sa.Column(
"time_created", sa.DateTime(timezone=True), server_default=sa.text("now()")
),
sa.ForeignKeyConstraint(["entity_type_id_name"], ["kg_entity_type.id_name"]),
sa.ForeignKeyConstraint(["document_id"], ["document.id"]),
sa.UniqueConstraint(
"name",
"entity_type_id_name",
"document_id",
name="uq_kg_entity_name_type_doc",
),
)
op.create_index("ix_entity_type_acl", "kg_entity", ["entity_type_id_name", "acl"])
op.create_index(
"ix_entity_name_search", "kg_entity", ["name", "entity_type_id_name"]
)
op.execute("DROP TABLE IF EXISTS kg_entity_extraction_staging CASCADE")
# Create KGEntityExtractionStaging table
op.create_table(
"kg_entity_extraction_staging",
sa.Column("id_name", sa.String(), primary_key=True, nullable=False, index=True),
sa.Column("name", sa.String(), nullable=False, index=True),
sa.Column("document_id", sa.String(), nullable=True, index=True),
sa.Column(
"alternative_names",
postgresql.ARRAY(sa.String()),
nullable=False,
server_default="{}",
),
sa.Column("entity_type_id_name", sa.String(), nullable=False, index=True),
sa.Column("description", sa.String(), nullable=True),
sa.Column(
"keywords",
postgresql.ARRAY(sa.String()),
nullable=False,
server_default="{}",
),
sa.Column("occurrences", sa.Integer(), server_default="1", nullable=False),
sa.Column(
"acl", postgresql.ARRAY(sa.String()), nullable=False, server_default="{}"
),
sa.Column("boosts", postgresql.JSONB, nullable=False, server_default="{}"),
sa.Column("attributes", postgresql.JSONB, nullable=False, server_default="{}"),
sa.Column("transferred_id_name", sa.String(), nullable=True, default=None),
sa.Column("entity_class", sa.String(), nullable=True, index=True),
sa.Column("entity_key", sa.String(), nullable=True, index=True),
sa.Column("entity_subtype", sa.String(), nullable=True, index=True),
sa.Column("parent_key", sa.String(), nullable=True, index=True),
sa.Column("event_time", sa.DateTime(timezone=True), nullable=True),
sa.Column(
"time_created", sa.DateTime(timezone=True), server_default=sa.text("now()")
),
sa.ForeignKeyConstraint(["entity_type_id_name"], ["kg_entity_type.id_name"]),
sa.ForeignKeyConstraint(["document_id"], ["document.id"]),
)
op.create_index(
"ix_entity_extraction_staging_acl",
"kg_entity_extraction_staging",
["entity_type_id_name", "acl"],
)
op.create_index(
"ix_entity_extraction_staging_name_search",
"kg_entity_extraction_staging",
["name", "entity_type_id_name"],
)
op.execute("DROP TABLE IF EXISTS kg_relationship CASCADE")
# Create KGRelationship table
op.create_table(
"kg_relationship",
sa.Column("id_name", sa.String(), nullable=False, index=True),
sa.Column("source_node", sa.String(), nullable=False, index=True),
sa.Column("target_node", sa.String(), nullable=False, index=True),
sa.Column("source_node_type", sa.String(), nullable=False, index=True),
sa.Column("target_node_type", sa.String(), nullable=False, index=True),
sa.Column("source_document", sa.String(), nullable=True, index=True),
sa.Column("type", sa.String(), nullable=False, index=True),
sa.Column("relationship_type_id_name", sa.String(), nullable=False, index=True),
sa.Column("occurrences", sa.Integer(), server_default="1", nullable=False),
sa.Column(
"time_updated",
sa.DateTime(timezone=True),
server_default=sa.text("now()"),
onupdate=sa.text("now()"),
),
sa.Column(
"time_created", sa.DateTime(timezone=True), server_default=sa.text("now()")
),
sa.ForeignKeyConstraint(["source_node"], ["kg_entity.id_name"]),
sa.ForeignKeyConstraint(["target_node"], ["kg_entity.id_name"]),
sa.ForeignKeyConstraint(["source_node_type"], ["kg_entity_type.id_name"]),
sa.ForeignKeyConstraint(["target_node_type"], ["kg_entity_type.id_name"]),
sa.ForeignKeyConstraint(["source_document"], ["document.id"]),
sa.ForeignKeyConstraint(
["relationship_type_id_name"], ["kg_relationship_type.id_name"]
),
sa.UniqueConstraint(
"source_node",
"target_node",
"type",
name="uq_kg_relationship_source_target_type",
),
sa.PrimaryKeyConstraint("id_name", "source_document"),
)
op.create_index(
"ix_kg_relationship_nodes", "kg_relationship", ["source_node", "target_node"]
)
op.execute("DROP TABLE IF EXISTS kg_relationship_extraction_staging CASCADE")
# Create KGRelationshipExtractionStaging table
op.create_table(
"kg_relationship_extraction_staging",
sa.Column("id_name", sa.String(), nullable=False, index=True),
sa.Column("source_node", sa.String(), nullable=False, index=True),
sa.Column("target_node", sa.String(), nullable=False, index=True),
sa.Column("source_node_type", sa.String(), nullable=False, index=True),
sa.Column("target_node_type", sa.String(), nullable=False, index=True),
sa.Column("source_document", sa.String(), nullable=True, index=True),
sa.Column("type", sa.String(), nullable=False, index=True),
sa.Column("relationship_type_id_name", sa.String(), nullable=False, index=True),
sa.Column("occurrences", sa.Integer(), server_default="1", nullable=False),
sa.Column("transferred", sa.Boolean(), nullable=False, server_default="false"),
sa.Column(
"time_created", sa.DateTime(timezone=True), server_default=sa.text("now()")
),
sa.ForeignKeyConstraint(
["source_node"], ["kg_entity_extraction_staging.id_name"]
),
sa.ForeignKeyConstraint(
["target_node"], ["kg_entity_extraction_staging.id_name"]
),
sa.ForeignKeyConstraint(["source_node_type"], ["kg_entity_type.id_name"]),
sa.ForeignKeyConstraint(["target_node_type"], ["kg_entity_type.id_name"]),
sa.ForeignKeyConstraint(["source_document"], ["document.id"]),
sa.ForeignKeyConstraint(
["relationship_type_id_name"],
["kg_relationship_type_extraction_staging.id_name"],
),
sa.UniqueConstraint(
"source_node",
"target_node",
"type",
name="uq_kg_relationship_extraction_staging_source_target_type",
),
sa.PrimaryKeyConstraint("id_name", "source_document"),
)
op.create_index(
"ix_kg_relationship_extraction_staging_nodes",
"kg_relationship_extraction_staging",
["source_node", "target_node"],
)
op.execute("DROP TABLE IF EXISTS kg_term CASCADE")
# Create KGTerm table
op.create_table(
"kg_term",
sa.Column("id_term", sa.String(), primary_key=True, nullable=False, index=True),
sa.Column(
"entity_types",
postgresql.ARRAY(sa.String()),
nullable=False,
server_default="{}",
),
sa.Column(
"time_updated",
sa.DateTime(timezone=True),
server_default=sa.text("now()"),
onupdate=sa.text("now()"),
),
sa.Column(
"time_created", sa.DateTime(timezone=True), server_default=sa.text("now()")
),
)
op.create_index("ix_search_term_entities", "kg_term", ["entity_types"])
op.create_index("ix_search_term_term", "kg_term", ["id_term"])
op.add_column(
"document",
sa.Column("kg_stage", sa.String(), nullable=True, index=True),
)
op.add_column(
"document",
sa.Column("kg_processing_time", sa.DateTime(timezone=True), nullable=True),
)
op.add_column(
"connector",
sa.Column(
"kg_processing_enabled",
sa.Boolean(),
nullable=True,
server_default="false",
),
)
op.add_column(
"connector",
sa.Column(
"kg_coverage_days",
sa.Integer(),
nullable=True,
server_default=None,
),
)
# Create GIN index for clustering and normalization
op.execute(
"CREATE INDEX IF NOT EXISTS idx_kg_entity_clustering_trigrams "
f"ON kg_entity USING GIN (name {POSTGRES_DEFAULT_SCHEMA}.gin_trgm_ops)"
)
op.execute(
"CREATE INDEX IF NOT EXISTS idx_kg_entity_normalization_trigrams "
"ON kg_entity USING GIN (name_trigrams)"
)
# Create kg_entity trigger to update kg_entity.name and its trigrams
alphanum_pattern = r"[^a-z0-9]+"
truncate_length = 1000
function = "update_kg_entity_name"
op.execute(
text(
f"""
CREATE OR REPLACE FUNCTION {function}()
RETURNS TRIGGER AS $$
DECLARE
name text;
cleaned_name text;
BEGIN
-- Set name to semantic_id if document_id is not NULL
IF NEW.document_id IS NOT NULL THEN
SELECT lower(semantic_id) INTO name
FROM document
WHERE id = NEW.document_id;
ELSE
name = lower(NEW.name);
END IF;
-- Clean name and truncate if too long
cleaned_name = regexp_replace(
name,
'{alphanum_pattern}', '', 'g'
);
IF length(cleaned_name) > {truncate_length} THEN
cleaned_name = left(cleaned_name, {truncate_length});
END IF;
-- Set name and name trigrams
NEW.name = name;
NEW.name_trigrams = {POSTGRES_DEFAULT_SCHEMA}.show_trgm(cleaned_name);
RETURN NEW;
END;
$$ LANGUAGE plpgsql;
"""
)
)
trigger = f"{function}_trigger"
op.execute(f"DROP TRIGGER IF EXISTS {trigger} ON kg_entity")
op.execute(
f"""
CREATE TRIGGER {trigger}
BEFORE INSERT OR UPDATE OF name
ON kg_entity
FOR EACH ROW
EXECUTE FUNCTION {function}();
"""
)
# Create kg_entity trigger to update kg_entity.name and its trigrams
function = "update_kg_entity_name_from_doc"
op.execute(
text(
f"""
CREATE OR REPLACE FUNCTION {function}()
RETURNS TRIGGER AS $$
DECLARE
doc_name text;
cleaned_name text;
BEGIN
doc_name = lower(NEW.semantic_id);
-- Clean name and truncate if too long
cleaned_name = regexp_replace(
doc_name,
'{alphanum_pattern}', '', 'g'
);
IF length(cleaned_name) > {truncate_length} THEN
cleaned_name = left(cleaned_name, {truncate_length});
END IF;
-- Set name and name trigrams for all entities referencing this document
UPDATE kg_entity
SET
name = doc_name,
name_trigrams = {POSTGRES_DEFAULT_SCHEMA}.show_trgm(cleaned_name)
WHERE document_id = NEW.id;
RETURN NEW;
END;
$$ LANGUAGE plpgsql;
"""
)
)
trigger = f"{function}_trigger"
op.execute(f"DROP TRIGGER IF EXISTS {trigger} ON document")
op.execute(
f"""
CREATE TRIGGER {trigger}
AFTER UPDATE OF semantic_id
ON document
FOR EACH ROW
EXECUTE FUNCTION {function}();
"""
)
def downgrade() -> None:
# Drop all views that start with 'kg_'
op.execute(
"""
DO $$
DECLARE
view_name text;
BEGIN
FOR view_name IN
SELECT c.relname
FROM pg_catalog.pg_class c
JOIN pg_catalog.pg_namespace n ON n.oid = c.relnamespace
WHERE c.relkind = 'v'
AND n.nspname = current_schema()
AND c.relname LIKE 'kg_relationships_with_access%'
LOOP
EXECUTE 'DROP VIEW IF EXISTS ' || quote_ident(view_name);
END LOOP;
END $$;
"""
)
op.execute(
"""
DO $$
DECLARE
view_name text;
BEGIN
FOR view_name IN
SELECT c.relname
FROM pg_catalog.pg_class c
JOIN pg_catalog.pg_namespace n ON n.oid = c.relnamespace
WHERE c.relkind = 'v'
AND n.nspname = current_schema()
AND c.relname LIKE 'allowed_docs%'
LOOP
EXECUTE 'DROP VIEW IF EXISTS ' || quote_ident(view_name);
END LOOP;
END $$;
"""
)
for table, function in (
("kg_entity", "update_kg_entity_name"),
("document", "update_kg_entity_name_from_doc"),
):
op.execute(f"DROP TRIGGER IF EXISTS {function}_trigger ON {table}")
op.execute(f"DROP FUNCTION IF EXISTS {function}()")
# Drop index
op.execute("DROP INDEX IF EXISTS idx_kg_entity_clustering_trigrams")
op.execute("DROP INDEX IF EXISTS idx_kg_entity_normalization_trigrams")
# Drop tables in reverse order of creation to handle dependencies
op.drop_table("kg_term")
op.drop_table("kg_relationship")
op.drop_table("kg_entity")
op.drop_table("kg_relationship_type")
op.drop_table("kg_relationship_extraction_staging")
op.drop_table("kg_relationship_type_extraction_staging")
op.drop_table("kg_entity_extraction_staging")
op.drop_table("kg_entity_type")
op.drop_column("connector", "kg_processing_enabled")
op.drop_column("connector", "kg_coverage_days")
op.drop_column("document", "kg_stage")
op.drop_column("document", "kg_processing_time")
op.drop_table("kg_config")
# Revoke usage on current schema for the readonly user
op.execute(
text(
f"""
DO $$
BEGIN
IF EXISTS (SELECT FROM pg_catalog.pg_roles WHERE rolname = '{DB_READONLY_USER}') THEN
EXECUTE format('REVOKE ALL ON SCHEMA %I FROM %I', current_schema(), '{DB_READONLY_USER}');
END IF;
END
$$;
"""
)
)
if not MULTI_TENANT:
# Drop read-only db user here only in single tenant mode. For multi-tenant mode,
# the user is dropped in the alembic_tenants migration.
op.execute(
text(
f"""
DO $$
BEGIN
IF EXISTS (SELECT FROM pg_catalog.pg_roles WHERE rolname = '{DB_READONLY_USER}') THEN
-- First revoke all privileges from the database
EXECUTE format('REVOKE ALL ON DATABASE %I FROM %I', current_database(), '{DB_READONLY_USER}');
-- Then drop the user
EXECUTE format('DROP USER %I', '{DB_READONLY_USER}');
END IF;
END
$$;
"""
)
)
op.execute(text("DROP EXTENSION IF EXISTS pg_trgm"))

View File

@@ -0,0 +1,380 @@
"""merge_default_assistants_into_unified
Revision ID: 505c488f6662
Revises: d09fc20a3c66
Create Date: 2025-09-09 19:00:56.816626
"""
import json
from typing import Any
from typing import NamedTuple
from uuid import UUID
from alembic import op
import sqlalchemy as sa
# revision identifiers, used by Alembic.
revision = "505c488f6662"
down_revision = "d09fc20a3c66"
branch_labels = None
depends_on = None
# Constants for the unified assistant
UNIFIED_ASSISTANT_NAME = "Assistant"
UNIFIED_ASSISTANT_DESCRIPTION = (
"Your AI assistant with search, web browsing, and image generation capabilities."
)
UNIFIED_ASSISTANT_NUM_CHUNKS = 25
UNIFIED_ASSISTANT_DISPLAY_PRIORITY = 0
UNIFIED_ASSISTANT_LLM_FILTER_EXTRACTION = True
UNIFIED_ASSISTANT_LLM_RELEVANCE_FILTER = False
UNIFIED_ASSISTANT_RECENCY_BIAS = "AUTO" # NOTE: needs to be capitalized
UNIFIED_ASSISTANT_CHUNKS_ABOVE = 0
UNIFIED_ASSISTANT_CHUNKS_BELOW = 0
UNIFIED_ASSISTANT_DATETIME_AWARE = True
# NOTE: tool specific prompts are handled on the fly and automatically injected
# into the prompt before passing to the LLM.
DEFAULT_SYSTEM_PROMPT = """
You are a highly capable, thoughtful, and precise assistant. Your goal is to deeply understand the \
user's intent, ask clarifying questions when needed, think step-by-step through complex problems, \
provide clear and accurate answers, and proactively anticipate helpful follow-up information. Always \
prioritize being truthful, nuanced, insightful, and efficient.
The current date is [[CURRENT_DATETIME]]
You use different text styles, bolding, emojis (sparingly), block quotes, and other formatting to make \
your responses more readable and engaging.
You use proper Markdown and LaTeX to format your responses for math, scientific, and chemical formulas, \
symbols, etc.: '$$\\n[expression]\\n$$' for standalone cases and '\\( [expression] \\)' when inline.
For code you prefer to use Markdown and specify the language.
You can use Markdown horizontal rules (---) to separate sections of your responses.
You can use Markdown tables to format your responses for data, lists, and other structured information.
""".strip()
INSERT_DICT: dict[str, Any] = {
"name": UNIFIED_ASSISTANT_NAME,
"description": UNIFIED_ASSISTANT_DESCRIPTION,
"system_prompt": DEFAULT_SYSTEM_PROMPT,
"num_chunks": UNIFIED_ASSISTANT_NUM_CHUNKS,
"display_priority": UNIFIED_ASSISTANT_DISPLAY_PRIORITY,
"llm_filter_extraction": UNIFIED_ASSISTANT_LLM_FILTER_EXTRACTION,
"llm_relevance_filter": UNIFIED_ASSISTANT_LLM_RELEVANCE_FILTER,
"recency_bias": UNIFIED_ASSISTANT_RECENCY_BIAS,
"chunks_above": UNIFIED_ASSISTANT_CHUNKS_ABOVE,
"chunks_below": UNIFIED_ASSISTANT_CHUNKS_BELOW,
"datetime_aware": UNIFIED_ASSISTANT_DATETIME_AWARE,
}
GENERAL_ASSISTANT_ID = -1
ART_ASSISTANT_ID = -3
class UserRow(NamedTuple):
"""Typed representation of user row from database query."""
id: UUID
chosen_assistants: list[int] | None
visible_assistants: list[int] | None
hidden_assistants: list[int] | None
pinned_assistants: list[int] | None
def upgrade() -> None:
conn = op.get_bind()
# Start transaction
conn.execute(sa.text("BEGIN"))
try:
# Step 1: Create or update the unified assistant (ID 0)
search_assistant = conn.execute(
sa.text("SELECT * FROM persona WHERE id = 0")
).fetchone()
if search_assistant:
# Update existing Search assistant to be the unified assistant
conn.execute(
sa.text(
"""
UPDATE persona
SET name = :name,
description = :description,
system_prompt = :system_prompt,
num_chunks = :num_chunks,
is_default_persona = true,
is_visible = true,
deleted = false,
display_priority = :display_priority,
llm_filter_extraction = :llm_filter_extraction,
llm_relevance_filter = :llm_relevance_filter,
recency_bias = :recency_bias,
chunks_above = :chunks_above,
chunks_below = :chunks_below,
datetime_aware = :datetime_aware,
starter_messages = null
WHERE id = 0
"""
),
INSERT_DICT,
)
else:
# Create new unified assistant with ID 0
conn.execute(
sa.text(
"""
INSERT INTO persona (
id, name, description, system_prompt, num_chunks,
is_default_persona, is_visible, deleted, display_priority,
llm_filter_extraction, llm_relevance_filter, recency_bias,
chunks_above, chunks_below, datetime_aware, starter_messages,
builtin_persona
) VALUES (
0, :name, :description, :system_prompt, :num_chunks,
true, true, false, :display_priority, :llm_filter_extraction,
:llm_relevance_filter, :recency_bias, :chunks_above, :chunks_below,
:datetime_aware, null, true
)
"""
),
INSERT_DICT,
)
# Step 2: Mark ALL builtin assistants as deleted (except the unified assistant ID 0)
conn.execute(
sa.text(
"""
UPDATE persona
SET deleted = true, is_visible = false, is_default_persona = false
WHERE builtin_persona = true AND id != 0
"""
)
)
# Step 3: Add all built-in tools to the unified assistant
# First, get the tool IDs for SearchTool, ImageGenerationTool, and WebSearchTool
search_tool = conn.execute(
sa.text("SELECT id FROM tool WHERE in_code_tool_id = 'SearchTool'")
).fetchone()
if not search_tool:
raise ValueError(
"SearchTool not found in database. Ensure tools migration has run first."
)
image_gen_tool = conn.execute(
sa.text("SELECT id FROM tool WHERE in_code_tool_id = 'ImageGenerationTool'")
).fetchone()
if not image_gen_tool:
raise ValueError(
"ImageGenerationTool not found in database. Ensure tools migration has run first."
)
# WebSearchTool is optional - may not be configured
web_search_tool = conn.execute(
sa.text("SELECT id FROM tool WHERE in_code_tool_id = 'WebSearchTool'")
).fetchone()
# Clear existing tool associations for persona 0
conn.execute(sa.text("DELETE FROM persona__tool WHERE persona_id = 0"))
# Add tools to the unified assistant
conn.execute(
sa.text(
"""
INSERT INTO persona__tool (persona_id, tool_id)
VALUES (0, :tool_id)
ON CONFLICT DO NOTHING
"""
),
{"tool_id": search_tool[0]},
)
conn.execute(
sa.text(
"""
INSERT INTO persona__tool (persona_id, tool_id)
VALUES (0, :tool_id)
ON CONFLICT DO NOTHING
"""
),
{"tool_id": image_gen_tool[0]},
)
if web_search_tool:
conn.execute(
sa.text(
"""
INSERT INTO persona__tool (persona_id, tool_id)
VALUES (0, :tool_id)
ON CONFLICT DO NOTHING
"""
),
{"tool_id": web_search_tool[0]},
)
# Step 4: Migrate existing chat sessions from all builtin assistants to unified assistant
conn.execute(
sa.text(
"""
UPDATE chat_session
SET persona_id = 0
WHERE persona_id IN (
SELECT id FROM persona WHERE builtin_persona = true AND id != 0
)
"""
)
)
# Step 5: Migrate user preferences - remove references to all builtin assistants
# First, get all builtin assistant IDs (except 0)
builtin_assistants_result = conn.execute(
sa.text(
"""
SELECT id FROM persona
WHERE builtin_persona = true AND id != 0
"""
)
).fetchall()
builtin_assistant_ids = [row[0] for row in builtin_assistants_result]
# Get all users with preferences
users_result = conn.execute(
sa.text(
"""
SELECT id, chosen_assistants, visible_assistants,
hidden_assistants, pinned_assistants
FROM "user"
"""
)
).fetchall()
for user_row in users_result:
user = UserRow(*user_row)
user_id: UUID = user.id
updates: dict[str, Any] = {}
# Remove all builtin assistants from chosen_assistants
if user.chosen_assistants:
new_chosen: list[int] = [
assistant_id
for assistant_id in user.chosen_assistants
if assistant_id not in builtin_assistant_ids
]
if new_chosen != user.chosen_assistants:
updates["chosen_assistants"] = json.dumps(new_chosen)
# Remove all builtin assistants from visible_assistants
if user.visible_assistants:
new_visible: list[int] = [
assistant_id
for assistant_id in user.visible_assistants
if assistant_id not in builtin_assistant_ids
]
if new_visible != user.visible_assistants:
updates["visible_assistants"] = json.dumps(new_visible)
# Add all builtin assistants to hidden_assistants
if user.hidden_assistants:
new_hidden: list[int] = list(user.hidden_assistants)
for old_id in builtin_assistant_ids:
if old_id not in new_hidden:
new_hidden.append(old_id)
if new_hidden != user.hidden_assistants:
updates["hidden_assistants"] = json.dumps(new_hidden)
else:
updates["hidden_assistants"] = json.dumps(builtin_assistant_ids)
# Remove all builtin assistants from pinned_assistants
if user.pinned_assistants:
new_pinned: list[int] = [
assistant_id
for assistant_id in user.pinned_assistants
if assistant_id not in builtin_assistant_ids
]
if new_pinned != user.pinned_assistants:
updates["pinned_assistants"] = json.dumps(new_pinned)
# Apply updates if any
if updates:
set_clause = ", ".join([f"{k} = :{k}" for k in updates.keys()])
updates["user_id"] = str(user_id) # Convert UUID to string for SQL
conn.execute(
sa.text(f'UPDATE "user" SET {set_clause} WHERE id = :user_id'),
updates,
)
# Commit transaction
conn.execute(sa.text("COMMIT"))
except Exception as e:
# Rollback on error
conn.execute(sa.text("ROLLBACK"))
raise e
def downgrade() -> None:
conn = op.get_bind()
# Start transaction
conn.execute(sa.text("BEGIN"))
try:
# Only restore General (ID -1) and Art (ID -3) assistants
# Step 1: Keep Search assistant (ID 0) as default but restore original state
conn.execute(
sa.text(
"""
UPDATE persona
SET is_default_persona = true,
is_visible = true,
deleted = false
WHERE id = 0
"""
)
)
# Step 2: Restore General assistant (ID -1)
conn.execute(
sa.text(
"""
UPDATE persona
SET deleted = false,
is_visible = true,
is_default_persona = true
WHERE id = :general_assistant_id
"""
),
{"general_assistant_id": GENERAL_ASSISTANT_ID},
)
# Step 3: Restore Art assistant (ID -3)
conn.execute(
sa.text(
"""
UPDATE persona
SET deleted = false,
is_visible = true,
is_default_persona = true
WHERE id = :art_assistant_id
"""
),
{"art_assistant_id": ART_ASSISTANT_ID},
)
# Note: We don't restore the original tool associations, names, or descriptions
# as those would require more complex logic to determine original state.
# We also cannot restore original chat session persona_ids as we don't
# have the original mappings.
# Other builtin assistants remain deleted as per the requirement.
# Commit transaction
conn.execute(sa.text("COMMIT"))
except Exception as e:
# Rollback on error
conn.execute(sa.text("ROLLBACK"))
raise e

View File

@@ -0,0 +1,90 @@
"""add stale column to external user group tables
Revision ID: 58c50ef19f08
Revises: 7b9b952abdf6
Create Date: 2025-06-25 14:08:14.162380
"""
from alembic import op
import sqlalchemy as sa
# revision identifiers, used by Alembic.
revision = "58c50ef19f08"
down_revision = "7b9b952abdf6"
branch_labels = None
depends_on = None
def upgrade() -> None:
# Add the stale column with default value False to user__external_user_group_id
op.add_column(
"user__external_user_group_id",
sa.Column("stale", sa.Boolean(), nullable=False, server_default="false"),
)
# Create index for efficient querying of stale rows by cc_pair_id
op.create_index(
"ix_user__external_user_group_id_cc_pair_id_stale",
"user__external_user_group_id",
["cc_pair_id", "stale"],
unique=False,
)
# Create index for efficient querying of all stale rows
op.create_index(
"ix_user__external_user_group_id_stale",
"user__external_user_group_id",
["stale"],
unique=False,
)
# Add the stale column with default value False to public_external_user_group
op.add_column(
"public_external_user_group",
sa.Column("stale", sa.Boolean(), nullable=False, server_default="false"),
)
# Create index for efficient querying of stale rows by cc_pair_id
op.create_index(
"ix_public_external_user_group_cc_pair_id_stale",
"public_external_user_group",
["cc_pair_id", "stale"],
unique=False,
)
# Create index for efficient querying of all stale rows
op.create_index(
"ix_public_external_user_group_stale",
"public_external_user_group",
["stale"],
unique=False,
)
def downgrade() -> None:
# Drop the indices for public_external_user_group first
op.drop_index(
"ix_public_external_user_group_stale", table_name="public_external_user_group"
)
op.drop_index(
"ix_public_external_user_group_cc_pair_id_stale",
table_name="public_external_user_group",
)
# Drop the stale column from public_external_user_group
op.drop_column("public_external_user_group", "stale")
# Drop the indices for user__external_user_group_id
op.drop_index(
"ix_user__external_user_group_id_stale",
table_name="user__external_user_group_id",
)
op.drop_index(
"ix_user__external_user_group_id_cc_pair_id_stale",
table_name="user__external_user_group_id",
)
# Drop the stale column from user__external_user_group_id
op.drop_column("user__external_user_group_id", "stale")

View File

@@ -0,0 +1,115 @@
"""add research agent database tables and chat message research fields
Revision ID: 5ae8240accb3
Revises: b558f51620b4
Create Date: 2025-08-06 14:29:24.691388
"""
from alembic import op
import sqlalchemy as sa
from sqlalchemy.dialects import postgresql
# revision identifiers, used by Alembic.
revision = "5ae8240accb3"
down_revision = "b558f51620b4"
branch_labels = None
depends_on = None
def upgrade() -> None:
# Add research_type and research_plan columns to chat_message table
op.add_column(
"chat_message",
sa.Column("research_type", sa.String(), nullable=True),
)
op.add_column(
"chat_message",
sa.Column("research_plan", postgresql.JSONB(), nullable=True),
)
# Create research_agent_iteration table
op.create_table(
"research_agent_iteration",
sa.Column("id", sa.Integer(), autoincrement=True, nullable=False),
sa.Column(
"primary_question_id",
sa.Integer(),
sa.ForeignKey("chat_message.id", ondelete="CASCADE"),
nullable=False,
),
sa.Column("iteration_nr", sa.Integer(), nullable=False),
sa.Column(
"created_at",
sa.DateTime(timezone=True),
server_default=sa.func.now(),
nullable=False,
),
sa.Column("purpose", sa.String(), nullable=True),
sa.Column("reasoning", sa.String(), nullable=True),
sa.PrimaryKeyConstraint("id"),
sa.UniqueConstraint(
"primary_question_id",
"iteration_nr",
name="_research_agent_iteration_unique_constraint",
),
)
# Create research_agent_iteration_sub_step table
op.create_table(
"research_agent_iteration_sub_step",
sa.Column("id", sa.Integer(), autoincrement=True, nullable=False),
sa.Column(
"primary_question_id",
sa.Integer(),
sa.ForeignKey("chat_message.id", ondelete="CASCADE"),
nullable=False,
),
sa.Column(
"parent_question_id",
sa.Integer(),
sa.ForeignKey("research_agent_iteration_sub_step.id", ondelete="CASCADE"),
nullable=True,
),
sa.Column("iteration_nr", sa.Integer(), nullable=False),
sa.Column("iteration_sub_step_nr", sa.Integer(), nullable=False),
sa.Column(
"created_at",
sa.DateTime(timezone=True),
server_default=sa.func.now(),
nullable=False,
),
sa.Column("sub_step_instructions", sa.String(), nullable=True),
sa.Column(
"sub_step_tool_id",
sa.Integer(),
sa.ForeignKey("tool.id"),
nullable=True,
),
sa.Column("reasoning", sa.String(), nullable=True),
sa.Column("sub_answer", sa.String(), nullable=True),
sa.Column("cited_doc_results", postgresql.JSONB(), nullable=True),
sa.Column("claims", postgresql.JSONB(), nullable=True),
sa.Column("generated_images", postgresql.JSONB(), nullable=True),
sa.Column("additional_data", postgresql.JSONB(), nullable=True),
sa.PrimaryKeyConstraint("id"),
sa.ForeignKeyConstraint(
["primary_question_id", "iteration_nr"],
[
"research_agent_iteration.primary_question_id",
"research_agent_iteration.iteration_nr",
],
ondelete="CASCADE",
),
)
def downgrade() -> None:
# Drop tables in reverse order
op.drop_table("research_agent_iteration_sub_step")
op.drop_table("research_agent_iteration")
# Remove columns from chat_message table
op.drop_column("chat_message", "research_plan")
op.drop_column("chat_message", "research_type")

View File

@@ -0,0 +1,24 @@
"""Add content type to UserFile
Revision ID: 5c448911b12f
Revises: 47a07e1a38f1
Create Date: 2025-04-25 16:59:48.182672
"""
from alembic import op
import sqlalchemy as sa
# revision identifiers, used by Alembic.
revision = "5c448911b12f"
down_revision = "47a07e1a38f1"
branch_labels: None = None
depends_on: None = None
def upgrade() -> None:
op.add_column("user_file", sa.Column("content_type", sa.String(), nullable=True))
def downgrade() -> None:
op.drop_column("user_file", "content_type")

View File

@@ -0,0 +1,132 @@
"""add file names to file connector config
Revision ID: 62c3a055a141
Revises: 3fc5d75723b3
Create Date: 2025-07-30 17:01:24.417551
"""
from alembic import op
import sqlalchemy as sa
import json
import os
import logging
# revision identifiers, used by Alembic.
revision = "62c3a055a141"
down_revision = "3fc5d75723b3"
branch_labels = None
depends_on = None
SKIP_FILE_NAME_MIGRATION = (
os.environ.get("SKIP_FILE_NAME_MIGRATION", "true").lower() == "true"
)
logger = logging.getLogger("alembic.runtime.migration")
def upgrade() -> None:
if SKIP_FILE_NAME_MIGRATION:
logger.info(
"Skipping file name migration. Hint: set SKIP_FILE_NAME_MIGRATION=false to run this migration"
)
return
logger.info("Running file name migration")
# Get connection
conn = op.get_bind()
# Get all FILE connectors with their configs
file_connectors = conn.execute(
sa.text(
"""
SELECT id, connector_specific_config
FROM connector
WHERE source = 'FILE'
"""
)
).fetchall()
for connector_id, config in file_connectors:
# Parse config if it's a string
if isinstance(config, str):
config = json.loads(config)
# Get file_locations list
file_locations = config.get("file_locations", [])
# Get display names for each file_id
file_names = []
for file_id in file_locations:
result = conn.execute(
sa.text(
"""
SELECT display_name
FROM file_record
WHERE file_id = :file_id
"""
),
{"file_id": file_id},
).fetchone()
if result:
file_names.append(result[0])
else:
file_names.append(file_id) # Should not happen
# Add file_names to config
new_config = dict(config)
new_config["file_names"] = file_names
# Update the connector
conn.execute(
sa.text(
"""
UPDATE connector
SET connector_specific_config = :new_config
WHERE id = :connector_id
"""
),
{"connector_id": connector_id, "new_config": json.dumps(new_config)},
)
def downgrade() -> None:
# Get connection
conn = op.get_bind()
# Remove file_names from all FILE connectors
file_connectors = conn.execute(
sa.text(
"""
SELECT id, connector_specific_config
FROM connector
WHERE source = 'FILE'
"""
)
).fetchall()
for connector_id, config in file_connectors:
# Parse config if it's a string
if isinstance(config, str):
config = json.loads(config)
# Remove file_names if it exists
if "file_names" in config:
new_config = dict(config)
del new_config["file_names"]
# Update the connector
conn.execute(
sa.text(
"""
UPDATE connector
SET connector_specific_config = :new_config
WHERE id = :connector_id
"""
),
{
"connector_id": connector_id,
"new_config": json.dumps(new_config),
},
)

View File

@@ -0,0 +1,37 @@
"""Add image input support to model config
Revision ID: 64bd5677aeb6
Revises: b30353be4eec
Create Date: 2025-09-28 15:48:12.003612
"""
from alembic import op
import sqlalchemy as sa
# revision identifiers, used by Alembic.
revision = "64bd5677aeb6"
down_revision = "b30353be4eec"
branch_labels = None
depends_on = None
def upgrade() -> None:
op.add_column(
"model_configuration",
sa.Column("supports_image_input", sa.Boolean(), nullable=True),
)
# Seems to be left over from when model visibility was introduced and a nullable field.
# Set any null is_visible values to False
connection = op.get_bind()
connection.execute(
sa.text(
"UPDATE model_configuration SET is_visible = false WHERE is_visible IS NULL"
)
)
def downgrade() -> None:
op.drop_column("model_configuration", "supports_image_input")

View File

@@ -0,0 +1,41 @@
"""remove kg subtype from db
Revision ID: 65bc6e0f8500
Revises: cec7ec36c505
Create Date: 2025-06-13 10:04:27.705976
"""
from alembic import op
import sqlalchemy as sa
# revision identifiers, used by Alembic.
revision = "65bc6e0f8500"
down_revision = "cec7ec36c505"
branch_labels = None
depends_on = None
def upgrade() -> None:
op.drop_column("kg_entity", "entity_class")
op.drop_column("kg_entity", "entity_subtype")
op.drop_column("kg_entity_extraction_staging", "entity_class")
op.drop_column("kg_entity_extraction_staging", "entity_subtype")
def downgrade() -> None:
op.add_column(
"kg_entity_extraction_staging",
sa.Column("entity_subtype", sa.String(), nullable=True, index=True),
)
op.add_column(
"kg_entity_extraction_staging",
sa.Column("entity_class", sa.String(), nullable=True, index=True),
)
op.add_column(
"kg_entity", sa.Column("entity_subtype", sa.String(), nullable=True, index=True)
)
op.add_column(
"kg_entity", sa.Column("entity_class", sa.String(), nullable=True, index=True)
)

View File

@@ -6,12 +6,6 @@ Create Date: 2025-04-01 07:26:10.539362
"""
from alembic import op
import sqlalchemy as sa
from sqlalchemy import inspect
import datetime
# revision identifiers, used by Alembic.
revision = "6a804aeb4830"
down_revision = "8e1ac4f39a9f"
@@ -19,99 +13,10 @@ branch_labels = None
depends_on = None
# Leaving this around only because some people might be on this migration
# originally was a duplicate of the user files migration
def upgrade() -> None:
# Check if user_file table already exists
conn = op.get_bind()
inspector = inspect(conn)
if not inspector.has_table("user_file"):
# Create user_folder table without parent_id
op.create_table(
"user_folder",
sa.Column("id", sa.Integer(), primary_key=True, autoincrement=True),
sa.Column("user_id", sa.UUID(), sa.ForeignKey("user.id"), nullable=True),
sa.Column("name", sa.String(length=255), nullable=True),
sa.Column("description", sa.String(length=255), nullable=True),
sa.Column("display_priority", sa.Integer(), nullable=True, default=0),
sa.Column(
"created_at", sa.DateTime(timezone=True), server_default=sa.func.now()
),
)
# Create user_file table with folder_id instead of parent_folder_id
op.create_table(
"user_file",
sa.Column("id", sa.Integer(), primary_key=True, autoincrement=True),
sa.Column("user_id", sa.UUID(), sa.ForeignKey("user.id"), nullable=True),
sa.Column(
"folder_id",
sa.Integer(),
sa.ForeignKey("user_folder.id"),
nullable=True,
),
sa.Column("link_url", sa.String(), nullable=True),
sa.Column("token_count", sa.Integer(), nullable=True),
sa.Column("file_type", sa.String(), nullable=True),
sa.Column("file_id", sa.String(length=255), nullable=False),
sa.Column("document_id", sa.String(length=255), nullable=False),
sa.Column("name", sa.String(length=255), nullable=False),
sa.Column(
"created_at",
sa.DateTime(),
default=datetime.datetime.utcnow,
),
sa.Column(
"cc_pair_id",
sa.Integer(),
sa.ForeignKey("connector_credential_pair.id"),
nullable=True,
unique=True,
),
)
# Create persona__user_file table
op.create_table(
"persona__user_file",
sa.Column(
"persona_id",
sa.Integer(),
sa.ForeignKey("persona.id"),
primary_key=True,
),
sa.Column(
"user_file_id",
sa.Integer(),
sa.ForeignKey("user_file.id"),
primary_key=True,
),
)
# Create persona__user_folder table
op.create_table(
"persona__user_folder",
sa.Column(
"persona_id",
sa.Integer(),
sa.ForeignKey("persona.id"),
primary_key=True,
),
sa.Column(
"user_folder_id",
sa.Integer(),
sa.ForeignKey("user_folder.id"),
primary_key=True,
),
)
op.add_column(
"connector_credential_pair",
sa.Column("is_user_file", sa.Boolean(), nullable=True, default=False),
)
# Update existing records to have is_user_file=False instead of NULL
op.execute(
"UPDATE connector_credential_pair SET is_user_file = FALSE WHERE is_user_file IS NULL"
)
pass
def downgrade() -> None:

View File

@@ -0,0 +1,37 @@
"""add queries and is web fetch to iteration answer
Revision ID: 6f4f86aef280
Revises: 03d710ccf29c
Create Date: 2025-10-14 18:08:30.920123
"""
from alembic import op
import sqlalchemy as sa
from sqlalchemy.dialects import postgresql
# revision identifiers, used by Alembic.
revision = "6f4f86aef280"
down_revision = "03d710ccf29c"
branch_labels = None
depends_on = None
def upgrade() -> None:
# Add is_web_fetch column
op.add_column(
"research_agent_iteration_sub_step",
sa.Column("is_web_fetch", sa.Boolean(), nullable=True),
)
# Add queries column
op.add_column(
"research_agent_iteration_sub_step",
sa.Column("queries", postgresql.JSONB(), nullable=True),
)
def downgrade() -> None:
op.drop_column("research_agent_iteration_sub_step", "queries")
op.drop_column("research_agent_iteration_sub_step", "is_web_fetch")

View File

@@ -6,11 +6,8 @@ Create Date: 2024-04-15 01:36:02.952809
"""
import json
from typing import cast
from alembic import op
import sqlalchemy as sa
from onyx.key_value_store.factory import get_kv_store
# revision identifiers, used by Alembic.
revision = "703313b75876"
@@ -54,27 +51,10 @@ def upgrade() -> None:
sa.PrimaryKeyConstraint("rate_limit_id", "user_group_id"),
)
try:
settings_json = cast(str, get_kv_store().load("token_budget_settings"))
settings = json.loads(settings_json)
is_enabled = settings.get("enable_token_budget", False)
token_budget = settings.get("token_budget", -1)
period_hours = settings.get("period_hours", -1)
if is_enabled and token_budget > 0 and period_hours > 0:
op.execute(
f"INSERT INTO token_rate_limit \
(enabled, token_budget, period_hours, scope) VALUES \
({is_enabled}, {token_budget}, {period_hours}, 'GLOBAL')"
)
# Delete the dynamic config
get_kv_store().delete("token_budget_settings")
except Exception:
# Ignore if the dynamic config is not found
pass
# NOTE: rate limit settings used to be stored in the "token_budget_settings" key in the
# KeyValueStore. This will now be lost. The KV store works differently than it used to
# so the migration is fairly complicated and likely not worth it to support (pretty much
# nobody will have it set)
def downgrade() -> None:

View File

@@ -0,0 +1,237 @@
"""Add model-configuration table
Revision ID: 7a70b7664e37
Revises: d961aca62eb3
Create Date: 2025-04-10 15:00:35.984669
"""
from alembic import op
import sqlalchemy as sa
from sqlalchemy.dialects import postgresql
from onyx.llm.llm_provider_options import (
fetch_model_names_for_provider_as_set,
fetch_visible_model_names_for_provider_as_set,
)
# revision identifiers, used by Alembic.
revision = "7a70b7664e37"
down_revision = "d961aca62eb3"
branch_labels = None
depends_on = None
def _resolve(
provider_name: str,
model_names: list[str] | None,
display_model_names: list[str] | None,
default_model_name: str,
fast_default_model_name: str | None,
) -> set[tuple[str, bool]]:
models = set(model_names) if model_names else None
display_models = set(display_model_names) if display_model_names else None
# If both are defined, we need to make sure that `model_names` is a superset of `display_model_names`.
if models and display_models:
models = display_models.union(models)
# If only `model_names` is defined, then:
# - If default-model-names are available for the `provider_name`, then set `display_model_names` to it
# and set `model_names` to the union of those default-model-names with itself.
# - If no default-model-names are available, then set `display_models` to `models`.
#
# This preserves the invariant that `display_models` is a subset of `models`.
elif models and not display_models:
visible_default_models = fetch_visible_model_names_for_provider_as_set(
provider_name=provider_name
)
if visible_default_models:
display_models = set(visible_default_models)
models = display_models.union(models)
else:
display_models = set(models)
# If only the `display_model_names` are defined, then set `models` to the union of `display_model_names`
# and the default-model-names for that provider.
#
# This will also preserve the invariant that `display_models` is a subset of `models`.
elif not models and display_models:
default_models = fetch_model_names_for_provider_as_set(
provider_name=provider_name
)
if default_models:
models = display_models.union(default_models)
else:
models = set(display_models)
# If neither are defined, then set `models` and `display_models` to the default-model-names for the given provider.
#
# This will also preserve the invariant that `display_models` is a subset of `models`.
else:
default_models = fetch_model_names_for_provider_as_set(
provider_name=provider_name
)
visible_default_models = fetch_visible_model_names_for_provider_as_set(
provider_name=provider_name
)
if default_models:
if not visible_default_models:
raise RuntimeError
raise RuntimeError(
"If `default_models` is non-None, `visible_default_models` must be non-None too."
)
models = default_models
display_models = visible_default_models
# This is not a well-known llm-provider; we can't provide any model suggestions.
# Therefore, we set to the empty set and continue
else:
models = set()
display_models = set()
# It is possible that `default_model_name` is not in `models` and is not in `display_models`.
# It is also possible that `fast_default_model_name` is not in `models` and is not in `display_models`.
models.add(default_model_name)
if fast_default_model_name:
models.add(fast_default_model_name)
display_models.add(default_model_name)
if fast_default_model_name:
display_models.add(fast_default_model_name)
return set([(model, model in display_models) for model in models])
def upgrade() -> None:
op.create_table(
"model_configuration",
sa.Column("id", sa.Integer(), nullable=False),
sa.Column("llm_provider_id", sa.Integer(), nullable=False),
sa.Column("name", sa.String(), nullable=False),
sa.Column("is_visible", sa.Boolean(), nullable=False),
sa.Column("max_input_tokens", sa.Integer(), nullable=True),
sa.ForeignKeyConstraint(
["llm_provider_id"], ["llm_provider.id"], ondelete="CASCADE"
),
sa.PrimaryKeyConstraint("id"),
sa.UniqueConstraint("llm_provider_id", "name"),
)
# Create temporary sqlalchemy references to tables for data migration
llm_provider_table = sa.sql.table(
"llm_provider",
sa.column("id", sa.Integer),
sa.column("provider", sa.Integer),
sa.column("model_names", postgresql.ARRAY(sa.String)),
sa.column("display_model_names", postgresql.ARRAY(sa.String)),
sa.column("default_model_name", sa.String),
sa.column("fast_default_model_name", sa.String),
)
model_configuration_table = sa.sql.table(
"model_configuration",
sa.column("id", sa.Integer),
sa.column("llm_provider_id", sa.Integer),
sa.column("name", sa.String),
sa.column("is_visible", sa.Boolean),
sa.column("max_input_tokens", sa.Integer),
)
connection = op.get_bind()
llm_providers = connection.execute(
sa.select(
llm_provider_table.c.id,
llm_provider_table.c.provider,
llm_provider_table.c.model_names,
llm_provider_table.c.display_model_names,
llm_provider_table.c.default_model_name,
llm_provider_table.c.fast_default_model_name,
)
).fetchall()
for llm_provider in llm_providers:
provider_id = llm_provider[0]
provider_name = llm_provider[1]
model_names = llm_provider[2]
display_model_names = llm_provider[3]
default_model_name = llm_provider[4]
fast_default_model_name = llm_provider[5]
model_configurations = _resolve(
provider_name=provider_name,
model_names=model_names,
display_model_names=display_model_names,
default_model_name=default_model_name,
fast_default_model_name=fast_default_model_name,
)
for model_name, is_visible in model_configurations:
connection.execute(
model_configuration_table.insert().values(
llm_provider_id=provider_id,
name=model_name,
is_visible=is_visible,
max_input_tokens=None,
)
)
op.drop_column("llm_provider", "model_names")
op.drop_column("llm_provider", "display_model_names")
def downgrade() -> None:
llm_provider = sa.table(
"llm_provider",
sa.column("id", sa.Integer),
sa.column("model_names", postgresql.ARRAY(sa.String)),
sa.column("display_model_names", postgresql.ARRAY(sa.String)),
)
model_configuration = sa.table(
"model_configuration",
sa.column("id", sa.Integer),
sa.column("llm_provider_id", sa.Integer),
sa.column("name", sa.String),
sa.column("is_visible", sa.Boolean),
sa.column("max_input_tokens", sa.Integer),
)
op.add_column(
"llm_provider",
sa.Column(
"model_names",
postgresql.ARRAY(sa.VARCHAR()),
autoincrement=False,
nullable=True,
),
)
op.add_column(
"llm_provider",
sa.Column(
"display_model_names",
postgresql.ARRAY(sa.VARCHAR()),
autoincrement=False,
nullable=True,
),
)
connection = op.get_bind()
provider_ids = connection.execute(sa.select(llm_provider.c.id)).fetchall()
for (provider_id,) in provider_ids:
# Get all models for this provider
models = connection.execute(
sa.select(
model_configuration.c.name, model_configuration.c.is_visible
).where(model_configuration.c.llm_provider_id == provider_id)
).fetchall()
all_models = [model[0] for model in models]
visible_models = [model[0] for model in models if model[1]]
# Update provider with arrays
op.execute(
llm_provider.update()
.where(llm_provider.c.id == provider_id)
.values(model_names=all_models, display_model_names=visible_models)
)
op.drop_table("model_configuration")

View File

@@ -0,0 +1,318 @@
"""update-entities
Revision ID: 7b9b952abdf6
Revises: 36e9220ab794
Create Date: 2025-06-23 20:24:08.139201
"""
import json
from alembic import op
import sqlalchemy as sa
# revision identifiers, used by Alembic.
revision = "7b9b952abdf6"
down_revision = "36e9220ab794"
branch_labels = None
depends_on = None
def upgrade() -> None:
conn = op.get_bind()
# new entity type metadata_attribute_conversion
new_entity_type_conversion = {
"LINEAR": {
"team": {"name": "team", "keep": True, "implication_property": None},
"state": {"name": "state", "keep": True, "implication_property": None},
"priority": {
"name": "priority",
"keep": True,
"implication_property": None,
},
"estimate": {
"name": "estimate",
"keep": True,
"implication_property": None,
},
"created_at": {
"name": "created_at",
"keep": True,
"implication_property": None,
},
"started_at": {
"name": "started_at",
"keep": True,
"implication_property": None,
},
"completed_at": {
"name": "completed_at",
"keep": True,
"implication_property": None,
},
"due_date": {
"name": "due_date",
"keep": True,
"implication_property": None,
},
"creator": {
"name": "creator",
"keep": False,
"implication_property": {
"implied_entity_type": "from_email",
"implied_relationship_name": "is_creator_of",
},
},
"assignee": {
"name": "assignee",
"keep": False,
"implication_property": {
"implied_entity_type": "from_email",
"implied_relationship_name": "is_assignee_of",
},
},
},
"JIRA": {
"issuetype": {
"name": "subtype",
"keep": True,
"implication_property": None,
},
"status": {"name": "status", "keep": True, "implication_property": None},
"priority": {
"name": "priority",
"keep": True,
"implication_property": None,
},
"project_name": {
"name": "project",
"keep": True,
"implication_property": None,
},
"created": {
"name": "created_at",
"keep": True,
"implication_property": None,
},
"updated": {
"name": "updated_at",
"keep": True,
"implication_property": None,
},
"resolution_date": {
"name": "completed_at",
"keep": True,
"implication_property": None,
},
"duedate": {"name": "due_date", "keep": True, "implication_property": None},
"reporter_email": {
"name": "creator",
"keep": False,
"implication_property": {
"implied_entity_type": "from_email",
"implied_relationship_name": "is_creator_of",
},
},
"assignee_email": {
"name": "assignee",
"keep": False,
"implication_property": {
"implied_entity_type": "from_email",
"implied_relationship_name": "is_assignee_of",
},
},
"key": {"name": "key", "keep": True, "implication_property": None},
"parent": {"name": "parent", "keep": True, "implication_property": None},
},
"GITHUB_PR": {
"repo": {"name": "repository", "keep": True, "implication_property": None},
"state": {"name": "state", "keep": True, "implication_property": None},
"num_commits": {
"name": "num_commits",
"keep": True,
"implication_property": None,
},
"num_files_changed": {
"name": "num_files_changed",
"keep": True,
"implication_property": None,
},
"labels": {"name": "labels", "keep": True, "implication_property": None},
"merged": {"name": "merged", "keep": True, "implication_property": None},
"merged_at": {
"name": "merged_at",
"keep": True,
"implication_property": None,
},
"closed_at": {
"name": "closed_at",
"keep": True,
"implication_property": None,
},
"created_at": {
"name": "created_at",
"keep": True,
"implication_property": None,
},
"updated_at": {
"name": "updated_at",
"keep": True,
"implication_property": None,
},
"user": {
"name": "creator",
"keep": False,
"implication_property": {
"implied_entity_type": "from_email",
"implied_relationship_name": "is_creator_of",
},
},
"assignees": {
"name": "assignees",
"keep": False,
"implication_property": {
"implied_entity_type": "from_email",
"implied_relationship_name": "is_assignee_of",
},
},
},
"GITHUB_ISSUE": {
"repo": {"name": "repository", "keep": True, "implication_property": None},
"state": {"name": "state", "keep": True, "implication_property": None},
"labels": {"name": "labels", "keep": True, "implication_property": None},
"closed_at": {
"name": "closed_at",
"keep": True,
"implication_property": None,
},
"created_at": {
"name": "created_at",
"keep": True,
"implication_property": None,
},
"updated_at": {
"name": "updated_at",
"keep": True,
"implication_property": None,
},
"user": {
"name": "creator",
"keep": False,
"implication_property": {
"implied_entity_type": "from_email",
"implied_relationship_name": "is_creator_of",
},
},
"assignees": {
"name": "assignees",
"keep": False,
"implication_property": {
"implied_entity_type": "from_email",
"implied_relationship_name": "is_assignee_of",
},
},
},
"FIREFLIES": {},
"ACCOUNT": {},
"OPPORTUNITY": {
"name": {"name": "name", "keep": True, "implication_property": None},
"stage_name": {"name": "stage", "keep": True, "implication_property": None},
"type": {"name": "type", "keep": True, "implication_property": None},
"amount": {"name": "amount", "keep": True, "implication_property": None},
"fiscal_year": {
"name": "fiscal_year",
"keep": True,
"implication_property": None,
},
"fiscal_quarter": {
"name": "fiscal_quarter",
"keep": True,
"implication_property": None,
},
"is_closed": {
"name": "is_closed",
"keep": True,
"implication_property": None,
},
"close_date": {
"name": "close_date",
"keep": True,
"implication_property": None,
},
"probability": {
"name": "close_probability",
"keep": True,
"implication_property": None,
},
"created_date": {
"name": "created_at",
"keep": True,
"implication_property": None,
},
"last_modified_date": {
"name": "updated_at",
"keep": True,
"implication_property": None,
},
"account": {
"name": "account",
"keep": False,
"implication_property": {
"implied_entity_type": "ACCOUNT",
"implied_relationship_name": "is_account_of",
},
},
},
"VENDOR": {},
"EMPLOYEE": {},
}
current_entity_types = conn.execute(
sa.text("SELECT id_name, attributes from kg_entity_type")
).all()
for entity_type, attributes in current_entity_types:
# delete removed entity types
if entity_type not in new_entity_type_conversion:
op.execute(
sa.text(f"DELETE FROM kg_entity_type WHERE id_name = '{entity_type}'")
)
continue
# update entity type attributes
if "metadata_attributes" in attributes:
del attributes["metadata_attributes"]
attributes["metadata_attribute_conversion"] = new_entity_type_conversion[
entity_type
]
attributes_str = json.dumps(attributes).replace("'", "''")
op.execute(
sa.text(
f"UPDATE kg_entity_type SET attributes = '{attributes_str}'"
f"WHERE id_name = '{entity_type}'"
),
)
def downgrade() -> None:
conn = op.get_bind()
current_entity_types = conn.execute(
sa.text("SELECT id_name, attributes from kg_entity_type")
).all()
for entity_type, attributes in current_entity_types:
conversion = {}
if "metadata_attribute_conversion" in attributes:
conversion = attributes.pop("metadata_attribute_conversion")
attributes["metadata_attributes"] = {
attr: prop["name"] for attr, prop in conversion.items() if prop["keep"]
}
attributes_str = json.dumps(attributes).replace("'", "''")
op.execute(
sa.text(
f"UPDATE kg_entity_type SET attributes = '{attributes_str}'"
f"WHERE id_name = '{entity_type}'"
),
)

View File

@@ -0,0 +1,193 @@
"""Migration 4: User file UUID primary key swap
Revision ID: 7cc3fcc116c1
Revises: 16c37a30adf2
Create Date: 2025-09-22 09:54:38.292952
This migration performs the critical UUID primary key swap on user_file table.
It updates all foreign key references to use UUIDs instead of integers.
"""
from alembic import op
import sqlalchemy as sa
from sqlalchemy.dialects import postgresql as psql
import logging
logger = logging.getLogger("alembic.runtime.migration")
# revision identifiers, used by Alembic.
revision = "7cc3fcc116c1"
down_revision = "16c37a30adf2"
branch_labels = None
depends_on = None
def upgrade() -> None:
"""Swap user_file primary key from integer to UUID."""
bind = op.get_bind()
inspector = sa.inspect(bind)
# Verify we're in the expected state
user_file_columns = [col["name"] for col in inspector.get_columns("user_file")]
if "new_id" not in user_file_columns:
logger.warning(
"user_file.new_id not found - migration may have already been applied"
)
return
logger.info("Starting UUID primary key swap...")
# === Step 1: Update persona__user_file foreign key to UUID ===
logger.info("Updating persona__user_file foreign key...")
# Drop existing foreign key constraints
op.execute(
"ALTER TABLE persona__user_file DROP CONSTRAINT IF EXISTS persona__user_file_user_file_id_uuid_fkey"
)
op.execute(
"ALTER TABLE persona__user_file DROP CONSTRAINT IF EXISTS persona__user_file_user_file_id_fkey"
)
# Create new foreign key to user_file.new_id
op.create_foreign_key(
"persona__user_file_user_file_id_fkey",
"persona__user_file",
"user_file",
local_cols=["user_file_id_uuid"],
remote_cols=["new_id"],
)
# Drop the old integer column and rename UUID column
op.execute("ALTER TABLE persona__user_file DROP COLUMN IF EXISTS user_file_id")
op.alter_column(
"persona__user_file",
"user_file_id_uuid",
new_column_name="user_file_id",
existing_type=psql.UUID(as_uuid=True),
nullable=False,
)
# Recreate composite primary key
op.execute(
"ALTER TABLE persona__user_file DROP CONSTRAINT IF EXISTS persona__user_file_pkey"
)
op.execute(
"ALTER TABLE persona__user_file ADD PRIMARY KEY (persona_id, user_file_id)"
)
logger.info("Updated persona__user_file to use UUID foreign key")
# === Step 2: Perform the primary key swap on user_file ===
logger.info("Swapping user_file primary key to UUID...")
# Drop the primary key constraint
op.execute("ALTER TABLE user_file DROP CONSTRAINT IF EXISTS user_file_pkey")
# Drop the old id column and rename new_id to id
op.execute("ALTER TABLE user_file DROP COLUMN IF EXISTS id")
op.alter_column(
"user_file",
"new_id",
new_column_name="id",
existing_type=psql.UUID(as_uuid=True),
nullable=False,
)
# Set default for new inserts
op.alter_column(
"user_file",
"id",
existing_type=psql.UUID(as_uuid=True),
server_default=sa.text("gen_random_uuid()"),
)
# Create new primary key
op.execute("ALTER TABLE user_file ADD PRIMARY KEY (id)")
logger.info("Swapped user_file primary key to UUID")
# === Step 3: Update foreign key constraints ===
logger.info("Updating foreign key constraints...")
# Recreate persona__user_file foreign key to point to user_file.id
# Drop existing FK first to break dependency on the unique constraint
op.execute(
"ALTER TABLE persona__user_file DROP CONSTRAINT IF EXISTS persona__user_file_user_file_id_fkey"
)
# Drop the unique constraint on (formerly) new_id BEFORE recreating the FK,
# so the FK will bind to the primary key instead of the unique index.
op.execute("ALTER TABLE user_file DROP CONSTRAINT IF EXISTS uq_user_file_new_id")
# Now recreate FK to the primary key column
op.create_foreign_key(
"persona__user_file_user_file_id_fkey",
"persona__user_file",
"user_file",
local_cols=["user_file_id"],
remote_cols=["id"],
)
# Add foreign keys for project__user_file
existing_fks = inspector.get_foreign_keys("project__user_file")
has_user_file_fk = any(
fk.get("referred_table") == "user_file"
and fk.get("constrained_columns") == ["user_file_id"]
for fk in existing_fks
)
if not has_user_file_fk:
op.create_foreign_key(
"fk_project__user_file_user_file_id",
"project__user_file",
"user_file",
["user_file_id"],
["id"],
)
logger.info("Added project__user_file -> user_file foreign key")
has_project_fk = any(
fk.get("referred_table") == "user_project"
and fk.get("constrained_columns") == ["project_id"]
for fk in existing_fks
)
if not has_project_fk:
op.create_foreign_key(
"fk_project__user_file_project_id",
"project__user_file",
"user_project",
["project_id"],
["id"],
)
logger.info("Added project__user_file -> user_project foreign key")
# === Step 4: Mark files for document_id migration ===
logger.info("Marking files for background document_id migration...")
logger.info("Migration 4 (UUID primary key swap) completed successfully")
logger.info(
"NOTE: Background task will update document IDs in Vespa and search_doc"
)
def downgrade() -> None:
"""Revert UUID primary key back to integer (data destructive!)."""
logger.error("CRITICAL: Downgrading UUID primary key swap is data destructive!")
logger.error(
"This will break all UUID-based references created after the migration."
)
logger.error("Only proceed if absolutely necessary and have backups.")
# The downgrade would need to:
# 1. Add back integer columns
# 2. Generate new sequential IDs
# 3. Update all foreign key references
# 4. Swap primary keys back
# This is complex and risky, so we raise an error instead
raise NotImplementedError(
"Downgrade of UUID primary key swap is not supported due to data loss risk. "
"Manual intervention with data backup/restore is required."
)

View File

@@ -0,0 +1,249 @@
"""add_mcp_server_and_connection_config_models
Revision ID: 7ed603b64d5a
Revises: b329d00a9ea6
Create Date: 2025-07-28 17:35:59.900680
"""
from alembic import op
import sqlalchemy as sa
from sqlalchemy.dialects import postgresql
from onyx.db.enums import MCPAuthenticationType
# revision identifiers, used by Alembic.
revision = "7ed603b64d5a"
down_revision = "b329d00a9ea6"
branch_labels = None
depends_on = None
def upgrade() -> None:
"""Create tables and columns for MCP Server support"""
# 1. MCP Server main table (no FK constraints yet to avoid circular refs)
op.create_table(
"mcp_server",
sa.Column("id", sa.Integer(), primary_key=True),
sa.Column("owner", sa.String(), nullable=False),
sa.Column("name", sa.String(), nullable=False),
sa.Column("description", sa.String(), nullable=True),
sa.Column("server_url", sa.String(), nullable=False),
sa.Column(
"auth_type",
sa.Enum(
MCPAuthenticationType,
name="mcp_authentication_type",
native_enum=False,
),
nullable=False,
),
sa.Column("admin_connection_config_id", sa.Integer(), nullable=True),
sa.Column(
"created_at",
sa.DateTime(timezone=True),
server_default=sa.text("now()"), # type: ignore
nullable=False,
),
sa.Column(
"updated_at",
sa.DateTime(timezone=True),
server_default=sa.text("now()"), # type: ignore
nullable=False,
),
)
# 2. MCP Connection Config table (can reference mcp_server now that it exists)
op.create_table(
"mcp_connection_config",
sa.Column("id", sa.Integer(), primary_key=True),
sa.Column("mcp_server_id", sa.Integer(), nullable=True),
sa.Column("user_email", sa.String(), nullable=False, default=""),
sa.Column("config", sa.LargeBinary(), nullable=False),
sa.Column(
"created_at",
sa.DateTime(timezone=True),
server_default=sa.text("now()"), # type: ignore
nullable=False,
),
sa.Column(
"updated_at",
sa.DateTime(timezone=True),
server_default=sa.text("now()"), # type: ignore
nullable=False,
),
sa.ForeignKeyConstraint(
["mcp_server_id"], ["mcp_server.id"], ondelete="CASCADE"
),
)
# Helpful indexes
op.create_index(
"ix_mcp_connection_config_server_user",
"mcp_connection_config",
["mcp_server_id", "user_email"],
)
op.create_index(
"ix_mcp_connection_config_user_email",
"mcp_connection_config",
["user_email"],
)
# 3. Add the back-references from mcp_server to connection configs
op.create_foreign_key(
"mcp_server_admin_config_fk",
"mcp_server",
"mcp_connection_config",
["admin_connection_config_id"],
["id"],
ondelete="SET NULL",
)
# 4. Association / access-control tables
op.create_table(
"mcp_server__user",
sa.Column("mcp_server_id", sa.Integer(), primary_key=True),
sa.Column("user_id", sa.UUID(), primary_key=True),
sa.ForeignKeyConstraint(
["mcp_server_id"], ["mcp_server.id"], ondelete="CASCADE"
),
sa.ForeignKeyConstraint(["user_id"], ["user.id"], ondelete="CASCADE"),
)
op.create_table(
"mcp_server__user_group",
sa.Column("mcp_server_id", sa.Integer(), primary_key=True),
sa.Column("user_group_id", sa.Integer(), primary_key=True),
sa.ForeignKeyConstraint(
["mcp_server_id"], ["mcp_server.id"], ondelete="CASCADE"
),
sa.ForeignKeyConstraint(["user_group_id"], ["user_group.id"]),
)
# 5. Update existing `tool` table allow tools to belong to an MCP server
op.add_column(
"tool",
sa.Column("mcp_server_id", sa.Integer(), nullable=True),
)
# Add column for MCP tool input schema
op.add_column(
"tool",
sa.Column("mcp_input_schema", postgresql.JSONB(), nullable=True),
)
op.create_foreign_key(
"tool_mcp_server_fk",
"tool",
"mcp_server",
["mcp_server_id"],
["id"],
ondelete="CASCADE",
)
# 6. Update persona__tool foreign keys to cascade delete
# This ensures that when a tool is deleted (including via MCP server deletion),
# the corresponding persona__tool rows are also deleted
op.drop_constraint(
"persona__tool_tool_id_fkey", "persona__tool", type_="foreignkey"
)
op.drop_constraint(
"persona__tool_persona_id_fkey", "persona__tool", type_="foreignkey"
)
op.create_foreign_key(
"persona__tool_persona_id_fkey",
"persona__tool",
"persona",
["persona_id"],
["id"],
ondelete="CASCADE",
)
op.create_foreign_key(
"persona__tool_tool_id_fkey",
"persona__tool",
"tool",
["tool_id"],
["id"],
ondelete="CASCADE",
)
# 7. Update research_agent_iteration_sub_step foreign key to SET NULL on delete
# This ensures that when a tool is deleted, the sub_step_tool_id is set to NULL
# instead of causing a foreign key constraint violation
op.drop_constraint(
"research_agent_iteration_sub_step_sub_step_tool_id_fkey",
"research_agent_iteration_sub_step",
type_="foreignkey",
)
op.create_foreign_key(
"research_agent_iteration_sub_step_sub_step_tool_id_fkey",
"research_agent_iteration_sub_step",
"tool",
["sub_step_tool_id"],
["id"],
ondelete="SET NULL",
)
def downgrade() -> None:
"""Drop all MCP-related tables / columns"""
# # # 1. Drop FK & columns from tool
# op.drop_constraint("tool_mcp_server_fk", "tool", type_="foreignkey")
op.execute("DELETE FROM tool WHERE mcp_server_id IS NOT NULL")
op.drop_constraint(
"research_agent_iteration_sub_step_sub_step_tool_id_fkey",
"research_agent_iteration_sub_step",
type_="foreignkey",
)
op.create_foreign_key(
"research_agent_iteration_sub_step_sub_step_tool_id_fkey",
"research_agent_iteration_sub_step",
"tool",
["sub_step_tool_id"],
["id"],
)
# Restore original persona__tool foreign keys (without CASCADE)
op.drop_constraint(
"persona__tool_persona_id_fkey", "persona__tool", type_="foreignkey"
)
op.drop_constraint(
"persona__tool_tool_id_fkey", "persona__tool", type_="foreignkey"
)
op.create_foreign_key(
"persona__tool_persona_id_fkey",
"persona__tool",
"persona",
["persona_id"],
["id"],
)
op.create_foreign_key(
"persona__tool_tool_id_fkey",
"persona__tool",
"tool",
["tool_id"],
["id"],
)
op.drop_column("tool", "mcp_input_schema")
op.drop_column("tool", "mcp_server_id")
# 2. Drop association tables
op.drop_table("mcp_server__user_group")
op.drop_table("mcp_server__user")
# 3. Drop FK from mcp_server to connection configs
op.drop_constraint("mcp_server_admin_config_fk", "mcp_server", type_="foreignkey")
# 4. Drop connection config indexes & table
op.drop_index(
"ix_mcp_connection_config_user_email", table_name="mcp_connection_config"
)
op.drop_index(
"ix_mcp_connection_config_server_user", table_name="mcp_connection_config"
)
op.drop_table("mcp_connection_config")
# 5. Finally drop mcp_server table
op.drop_table("mcp_server")

View File

@@ -0,0 +1,38 @@
"""drop include citations
Revision ID: 8818cf73fa1a
Revises: 7ed603b64d5a
Create Date: 2025-09-02 19:43:50.060680
"""
from alembic import op
import sqlalchemy as sa
# revision identifiers, used by Alembic.
revision = "8818cf73fa1a"
down_revision = "7ed603b64d5a"
branch_labels = None
depends_on = None
def upgrade() -> None:
op.drop_column("prompt", "include_citations")
def downgrade() -> None:
op.add_column(
"prompt",
sa.Column(
"include_citations",
sa.BOOLEAN(),
autoincrement=False,
nullable=True,
),
)
# Set include_citations based on prompt name: FALSE for ImageGeneration, TRUE for others
op.execute(
sa.text(
"UPDATE prompt SET include_citations = CASE WHEN name = 'ImageGeneration' THEN FALSE ELSE TRUE END"
)
)

View File

@@ -0,0 +1,341 @@
"""tag-fix
Revision ID: 90e3b9af7da4
Revises: 62c3a055a141
Create Date: 2025-08-01 20:58:14.607624
"""
import json
import logging
import os
from typing import cast
from typing import Generator
from alembic import op
import sqlalchemy as sa
from onyx.document_index.factory import get_default_document_index
from onyx.document_index.vespa_constants import DOCUMENT_ID_ENDPOINT
from onyx.db.search_settings import SearchSettings
from onyx.configs.app_configs import AUTH_TYPE
from onyx.configs.constants import AuthType
from onyx.document_index.vespa.shared_utils.utils import get_vespa_http_client
logger = logging.getLogger("alembic.runtime.migration")
# revision identifiers, used by Alembic.
revision = "90e3b9af7da4"
down_revision = "62c3a055a141"
branch_labels = None
depends_on = None
SKIP_TAG_FIX = os.environ.get("SKIP_TAG_FIX", "true").lower() == "true"
# override for cloud
if AUTH_TYPE == AuthType.CLOUD:
SKIP_TAG_FIX = True
def set_is_list_for_known_tags() -> None:
"""
Sets is_list to true for all tags that are known to be lists.
"""
LIST_METADATA: list[tuple[str, str]] = [
("CLICKUP", "tags"),
("CONFLUENCE", "labels"),
("DISCOURSE", "tags"),
("FRESHDESK", "emails"),
("GITHUB", "assignees"),
("GITHUB", "labels"),
("GURU", "tags"),
("GURU", "folders"),
("HUBSPOT", "associated_contact_ids"),
("HUBSPOT", "associated_company_ids"),
("HUBSPOT", "associated_deal_ids"),
("HUBSPOT", "associated_ticket_ids"),
("JIRA", "labels"),
("MEDIAWIKI", "categories"),
("ZENDESK", "labels"),
("ZENDESK", "content_tags"),
]
bind = op.get_bind()
for source, key in LIST_METADATA:
bind.execute(
sa.text(
f"""
UPDATE tag
SET is_list = true
WHERE tag_key = '{key}'
AND source = '{source}'
"""
)
)
def set_is_list_for_list_tags() -> None:
"""
Sets is_list to true for all tags which have multiple values for a given
document, key, and source triplet. This only works if we remove old tags
from the database.
"""
bind = op.get_bind()
bind.execute(
sa.text(
"""
UPDATE tag
SET is_list = true
FROM (
SELECT DISTINCT tag.tag_key, tag.source
FROM tag
JOIN document__tag ON tag.id = document__tag.tag_id
GROUP BY tag.tag_key, tag.source, document__tag.document_id
HAVING count(*) > 1
) AS list_tags
WHERE tag.tag_key = list_tags.tag_key
AND tag.source = list_tags.source
"""
)
)
def log_list_tags() -> None:
bind = op.get_bind()
result = bind.execute(
sa.text(
"""
SELECT DISTINCT source, tag_key
FROM tag
WHERE is_list
ORDER BY source, tag_key
"""
)
).fetchall()
logger.info(
"List tags:\n" + "\n".join(f" {source}: {key}" for source, key in result)
)
def remove_old_tags() -> None:
"""
Removes old tags from the database.
Previously, there was a bug where if a document got indexed with a tag and then
the document got reindexed, the old tag would not be removed.
This function removes those old tags by comparing it against the tags in vespa.
"""
current_search_settings, future_search_settings = active_search_settings()
document_index = get_default_document_index(
current_search_settings, future_search_settings
)
# Get the index name
if hasattr(document_index, "index_name"):
index_name = document_index.index_name
else:
# Default index name if we can't get it from the document_index
index_name = "danswer_index"
for batch in _get_batch_documents_with_multiple_tags():
n_deleted = 0
for document_id in batch:
true_metadata = _get_vespa_metadata(document_id, index_name)
tags = _get_document_tags(document_id)
# identify document__tags to delete
to_delete: list[str] = []
for tag_id, tag_key, tag_value in tags:
true_val = true_metadata.get(tag_key, "")
if (isinstance(true_val, list) and tag_value not in true_val) or (
isinstance(true_val, str) and tag_value != true_val
):
to_delete.append(str(tag_id))
if not to_delete:
continue
# delete old document__tags
bind = op.get_bind()
result = bind.execute(
sa.text(
f"""
DELETE FROM document__tag
WHERE document_id = '{document_id}'
AND tag_id IN ({','.join(to_delete)})
"""
)
)
n_deleted += result.rowcount
logger.info(f"Processed {len(batch)} documents and deleted {n_deleted} tags")
def active_search_settings() -> tuple[SearchSettings, SearchSettings | None]:
result = op.get_bind().execute(
sa.text(
"""
SELECT * FROM search_settings WHERE status = 'PRESENT' ORDER BY id DESC LIMIT 1
"""
)
)
search_settings_fetch = result.fetchall()
search_settings = (
SearchSettings(**search_settings_fetch[0]._asdict())
if search_settings_fetch
else None
)
result2 = op.get_bind().execute(
sa.text(
"""
SELECT * FROM search_settings WHERE status = 'FUTURE' ORDER BY id DESC LIMIT 1
"""
)
)
search_settings_future_fetch = result2.fetchall()
search_settings_future = (
SearchSettings(**search_settings_future_fetch[0]._asdict())
if search_settings_future_fetch
else None
)
if not isinstance(search_settings, SearchSettings):
raise RuntimeError(
"current search settings is of type " + str(type(search_settings))
)
if (
not isinstance(search_settings_future, SearchSettings)
and search_settings_future is not None
):
raise RuntimeError(
"future search settings is of type " + str(type(search_settings_future))
)
return search_settings, search_settings_future
def _get_batch_documents_with_multiple_tags(
batch_size: int = 128,
) -> Generator[list[str], None, None]:
"""
Returns a list of document ids which contain a one to many tag.
The document may either contain a list metadata value, or may contain leftover
old tags from reindexing.
"""
offset_clause = ""
bind = op.get_bind()
while True:
batch = bind.execute(
sa.text(
f"""
SELECT DISTINCT document__tag.document_id
FROM tag
JOIN document__tag ON tag.id = document__tag.tag_id
GROUP BY tag.tag_key, tag.source, document__tag.document_id
HAVING count(*) > 1 {offset_clause}
ORDER BY document__tag.document_id
LIMIT {batch_size}
"""
)
).fetchall()
if not batch:
break
doc_ids = [document_id for document_id, in batch]
yield doc_ids
offset_clause = f"AND document__tag.document_id > '{doc_ids[-1]}'"
def _get_vespa_metadata(
document_id: str, index_name: str
) -> dict[str, str | list[str]]:
url = DOCUMENT_ID_ENDPOINT.format(index_name=index_name)
# Document-Selector language
selection = (
f"{index_name}.document_id=='{document_id}' and {index_name}.chunk_id==0"
)
params: dict[str, str | int] = {
"selection": selection,
"wantedDocumentCount": 1,
"fieldSet": f"{index_name}:metadata",
}
with get_vespa_http_client() as client:
resp = client.get(url, params=params)
resp.raise_for_status()
docs = resp.json().get("documents", [])
if not docs:
raise RuntimeError(f"No chunk-0 found for document {document_id}")
# for some reason, metadata is a string
metadata = docs[0]["fields"]["metadata"]
return json.loads(metadata)
def _get_document_tags(document_id: str) -> list[tuple[int, str, str]]:
bind = op.get_bind()
result = bind.execute(
sa.text(
f"""
SELECT tag.id, tag.tag_key, tag.tag_value
FROM tag
JOIN document__tag ON tag.id = document__tag.tag_id
WHERE document__tag.document_id = '{document_id}'
"""
)
).fetchall()
return cast(list[tuple[int, str, str]], result)
def upgrade() -> None:
op.add_column(
"tag",
sa.Column("is_list", sa.Boolean(), nullable=False, server_default="false"),
)
op.drop_constraint(
constraint_name="_tag_key_value_source_uc",
table_name="tag",
type_="unique",
)
op.create_unique_constraint(
constraint_name="_tag_key_value_source_list_uc",
table_name="tag",
columns=["tag_key", "tag_value", "source", "is_list"],
)
set_is_list_for_known_tags()
if SKIP_TAG_FIX:
logger.warning(
"Skipping removal of old tags. "
"This can cause issues when using the knowledge graph, or "
"when filtering for documents by tags."
)
log_list_tags()
return
remove_old_tags()
set_is_list_for_list_tags()
# debug
log_list_tags()
def downgrade() -> None:
# the migration adds and populates the is_list column, and removes old bugged tags
# there isn't a point in adding back the bugged tags, so we just drop the column
op.drop_constraint(
constraint_name="_tag_key_value_source_list_uc",
table_name="tag",
type_="unique",
)
op.create_unique_constraint(
constraint_name="_tag_key_value_source_uc",
table_name="tag",
columns=["tag_key", "tag_value", "source"],
)
op.drop_column("tag", "is_list")

View File

@@ -0,0 +1,45 @@
"""mcp_tool_enabled
Revision ID: 96a5702df6aa
Revises: 40926a4dab77
Create Date: 2025-10-09 12:10:21.733097
"""
from alembic import op
import sqlalchemy as sa
# revision identifiers, used by Alembic.
revision = "96a5702df6aa"
down_revision = "40926a4dab77"
branch_labels = None
depends_on = None
DELETE_DISABLED_TOOLS_SQL = "DELETE FROM tool WHERE enabled = false"
def upgrade() -> None:
op.add_column(
"tool",
sa.Column(
"enabled",
sa.Boolean(),
nullable=False,
server_default=sa.true(),
),
)
op.create_index(
"ix_tool_mcp_server_enabled",
"tool",
["mcp_server_id", "enabled"],
)
# Remove the server default so application controls defaulting
op.alter_column("tool", "enabled", server_default=None)
def downgrade() -> None:
op.execute(DELETE_DISABLED_TOOLS_SQL)
op.drop_index("ix_tool_mcp_server_enabled", table_name="tool")
op.drop_column("tool", "enabled")

View File

@@ -103,6 +103,7 @@ def upgrade() -> None:
def downgrade() -> None:
op.drop_column("connector_credential_pair", "is_user_file")
# Drop the persona__user_folder table
op.drop_table("persona__user_folder")
# Drop the persona__user_file table
@@ -111,4 +112,3 @@ def downgrade() -> None:
op.drop_table("user_file")
# Drop the user_folder table
op.drop_table("user_folder")
op.drop_column("connector_credential_pair", "is_user_file")

View File

@@ -0,0 +1,257 @@
"""Migration 1: User file schema additions
Revision ID: 9b66d3156fc6
Revises: b4ef3ae0bf6e
Create Date: 2025-09-22 09:42:06.086732
This migration adds new columns and tables without modifying existing data.
It is safe to run and can be easily rolled back.
"""
from alembic import op
import sqlalchemy as sa
from sqlalchemy.dialects import postgresql as psql
import logging
logger = logging.getLogger("alembic.runtime.migration")
# revision identifiers, used by Alembic.
revision = "9b66d3156fc6"
down_revision = "b4ef3ae0bf6e"
branch_labels = None
depends_on = None
def upgrade() -> None:
"""Add new columns and tables without modifying existing data."""
# Enable pgcrypto for UUID generation
op.execute("CREATE EXTENSION IF NOT EXISTS pgcrypto")
bind = op.get_bind()
inspector = sa.inspect(bind)
# === USER_FILE: Add new columns ===
logger.info("Adding new columns to user_file table...")
user_file_columns = [col["name"] for col in inspector.get_columns("user_file")]
# Check if ID is already UUID (in case of re-run after partial migration)
id_is_uuid = any(
col["name"] == "id" and "uuid" in str(col["type"]).lower()
for col in inspector.get_columns("user_file")
)
# Add transitional UUID column only if ID is not already UUID
if "new_id" not in user_file_columns and not id_is_uuid:
op.add_column(
"user_file",
sa.Column(
"new_id",
psql.UUID(as_uuid=True),
nullable=True,
server_default=sa.text("gen_random_uuid()"),
),
)
op.create_unique_constraint("uq_user_file_new_id", "user_file", ["new_id"])
logger.info("Added new_id column to user_file")
# Add status column
if "status" not in user_file_columns:
op.add_column(
"user_file",
sa.Column(
"status",
sa.Enum(
"PROCESSING",
"COMPLETED",
"FAILED",
"CANCELED",
name="userfilestatus",
native_enum=False,
),
nullable=False,
server_default="PROCESSING",
),
)
logger.info("Added status column to user_file")
# Add other tracking columns
if "chunk_count" not in user_file_columns:
op.add_column(
"user_file", sa.Column("chunk_count", sa.Integer(), nullable=True)
)
logger.info("Added chunk_count column to user_file")
if "last_accessed_at" not in user_file_columns:
op.add_column(
"user_file",
sa.Column("last_accessed_at", sa.DateTime(timezone=True), nullable=True),
)
logger.info("Added last_accessed_at column to user_file")
if "needs_project_sync" not in user_file_columns:
op.add_column(
"user_file",
sa.Column(
"needs_project_sync",
sa.Boolean(),
nullable=False,
server_default=sa.text("false"),
),
)
logger.info("Added needs_project_sync column to user_file")
if "last_project_sync_at" not in user_file_columns:
op.add_column(
"user_file",
sa.Column(
"last_project_sync_at", sa.DateTime(timezone=True), nullable=True
),
)
logger.info("Added last_project_sync_at column to user_file")
if "document_id_migrated" not in user_file_columns:
op.add_column(
"user_file",
sa.Column(
"document_id_migrated",
sa.Boolean(),
nullable=False,
server_default=sa.text("true"),
),
)
logger.info("Added document_id_migrated column to user_file")
# === USER_FOLDER -> USER_PROJECT rename ===
table_names = set(inspector.get_table_names())
if "user_folder" in table_names:
logger.info("Updating user_folder table...")
# Make description nullable first
op.alter_column("user_folder", "description", nullable=True)
# Rename table if user_project doesn't exist
if "user_project" not in table_names:
op.execute("ALTER TABLE user_folder RENAME TO user_project")
logger.info("Renamed user_folder to user_project")
elif "user_project" in table_names:
# If already renamed, ensure column nullability
project_cols = [col["name"] for col in inspector.get_columns("user_project")]
if "description" in project_cols:
op.alter_column("user_project", "description", nullable=True)
# Add instructions column to user_project
inspector = sa.inspect(bind) # Refresh after rename
if "user_project" in inspector.get_table_names():
project_columns = [col["name"] for col in inspector.get_columns("user_project")]
if "instructions" not in project_columns:
op.add_column(
"user_project",
sa.Column("instructions", sa.String(), nullable=True),
)
logger.info("Added instructions column to user_project")
# === CHAT_SESSION: Add project_id ===
chat_session_columns = [
col["name"] for col in inspector.get_columns("chat_session")
]
if "project_id" not in chat_session_columns:
op.add_column(
"chat_session",
sa.Column("project_id", sa.Integer(), nullable=True),
)
logger.info("Added project_id column to chat_session")
# === PERSONA__USER_FILE: Add UUID column ===
persona_user_file_columns = [
col["name"] for col in inspector.get_columns("persona__user_file")
]
if "user_file_id_uuid" not in persona_user_file_columns:
op.add_column(
"persona__user_file",
sa.Column("user_file_id_uuid", psql.UUID(as_uuid=True), nullable=True),
)
logger.info("Added user_file_id_uuid column to persona__user_file")
# === PROJECT__USER_FILE: Create new table ===
if "project__user_file" not in inspector.get_table_names():
op.create_table(
"project__user_file",
sa.Column("project_id", sa.Integer(), nullable=False),
sa.Column("user_file_id", psql.UUID(as_uuid=True), nullable=False),
sa.PrimaryKeyConstraint("project_id", "user_file_id"),
)
op.create_index(
"idx_project__user_file_user_file_id",
"project__user_file",
["user_file_id"],
)
logger.info("Created project__user_file table")
logger.info("Migration 1 (schema additions) completed successfully")
def downgrade() -> None:
"""Remove added columns and tables."""
bind = op.get_bind()
inspector = sa.inspect(bind)
logger.info("Starting downgrade of schema additions...")
# Drop project__user_file table
if "project__user_file" in inspector.get_table_names():
op.drop_index("idx_project__user_file_user_file_id", "project__user_file")
op.drop_table("project__user_file")
logger.info("Dropped project__user_file table")
# Remove columns from persona__user_file
if "persona__user_file" in inspector.get_table_names():
columns = [col["name"] for col in inspector.get_columns("persona__user_file")]
if "user_file_id_uuid" in columns:
op.drop_column("persona__user_file", "user_file_id_uuid")
logger.info("Dropped user_file_id_uuid from persona__user_file")
# Remove columns from chat_session
if "chat_session" in inspector.get_table_names():
columns = [col["name"] for col in inspector.get_columns("chat_session")]
if "project_id" in columns:
op.drop_column("chat_session", "project_id")
logger.info("Dropped project_id from chat_session")
# Rename user_project back to user_folder and remove instructions
if "user_project" in inspector.get_table_names():
columns = [col["name"] for col in inspector.get_columns("user_project")]
if "instructions" in columns:
op.drop_column("user_project", "instructions")
op.execute("ALTER TABLE user_project RENAME TO user_folder")
op.alter_column("user_folder", "description", nullable=False)
logger.info("Renamed user_project back to user_folder")
# Remove columns from user_file
if "user_file" in inspector.get_table_names():
columns = [col["name"] for col in inspector.get_columns("user_file")]
columns_to_drop = [
"document_id_migrated",
"last_project_sync_at",
"needs_project_sync",
"last_accessed_at",
"chunk_count",
"status",
]
for col in columns_to_drop:
if col in columns:
op.drop_column("user_file", col)
logger.info(f"Dropped {col} from user_file")
if "new_id" in columns:
op.drop_constraint("uq_user_file_new_id", "user_file", type_="unique")
op.drop_column("user_file", "new_id")
logger.info("Dropped new_id from user_file")
# Drop enum type if no columns use it
bind.execute(sa.text("DROP TYPE IF EXISTS userfilestatus"))
logger.info("Downgrade completed successfully")

View File

@@ -0,0 +1,32 @@
"""Add public_external_user_group table
Revision ID: a7688ab35c45
Revises: 5c448911b12f
Create Date: 2025-05-06 20:55:12.747875
"""
from alembic import op
import sqlalchemy as sa
# revision identifiers, used by Alembic.
revision = "a7688ab35c45"
down_revision = "5c448911b12f"
branch_labels = None
depends_on = None
def upgrade() -> None:
op.create_table(
"public_external_user_group",
sa.Column("external_user_group_id", sa.String(), nullable=False),
sa.Column("cc_pair_id", sa.Integer(), nullable=False),
sa.PrimaryKeyConstraint("external_user_group_id", "cc_pair_id"),
sa.ForeignKeyConstraint(
["cc_pair_id"], ["connector_credential_pair.id"], ondelete="CASCADE"
),
)
def downgrade() -> None:
op.drop_table("public_external_user_group")

View File

@@ -0,0 +1,225 @@
"""merge prompt into persona
Revision ID: abbfec3a5ac5
Revises: 8818cf73fa1a
Create Date: 2024-12-19 12:00:00.000000
"""
from alembic import op
import sqlalchemy as sa
from sqlalchemy.dialects import postgresql
# revision identifiers, used by Alembic.
revision = "abbfec3a5ac5"
down_revision = "8818cf73fa1a"
branch_labels = None
depends_on = None
MAX_PROMPT_LENGTH = 5_000_000
def upgrade() -> None:
"""NOTE: Prompts without any Personas will just be lost."""
# Step 1: Add new columns to persona table (only if they don't exist)
# Check if columns exist before adding them
connection = op.get_bind()
inspector = sa.inspect(connection)
existing_columns = [col["name"] for col in inspector.get_columns("persona")]
if "system_prompt" not in existing_columns:
op.add_column(
"persona",
sa.Column(
"system_prompt", sa.String(length=MAX_PROMPT_LENGTH), nullable=True
),
)
if "task_prompt" not in existing_columns:
op.add_column(
"persona",
sa.Column(
"task_prompt", sa.String(length=MAX_PROMPT_LENGTH), nullable=True
),
)
if "datetime_aware" not in existing_columns:
op.add_column(
"persona",
sa.Column(
"datetime_aware", sa.Boolean(), nullable=False, server_default="true"
),
)
# Step 2: Migrate data from prompt table to persona table (only if tables exist)
existing_tables = inspector.get_table_names()
if "prompt" in existing_tables and "persona__prompt" in existing_tables:
# For personas that have associated prompts, copy the prompt data
op.execute(
"""
UPDATE persona
SET
system_prompt = p.system_prompt,
task_prompt = p.task_prompt,
datetime_aware = p.datetime_aware
FROM (
-- Get the first prompt for each persona (in case there are multiple)
SELECT DISTINCT ON (pp.persona_id)
pp.persona_id,
pr.system_prompt,
pr.task_prompt,
pr.datetime_aware
FROM persona__prompt pp
JOIN prompt pr ON pp.prompt_id = pr.id
) p
WHERE persona.id = p.persona_id
"""
)
# Step 3: Update chat_message references
# Since chat messages referenced prompt_id, we need to update them to use persona_id
# This is complex as we need to map from prompt_id to persona_id
# Check if chat_message has prompt_id column
chat_message_columns = [
col["name"] for col in inspector.get_columns("chat_message")
]
if "prompt_id" in chat_message_columns:
op.execute(
"""
ALTER TABLE chat_message
DROP CONSTRAINT IF EXISTS chat_message__prompt_fk
"""
)
op.drop_column("chat_message", "prompt_id")
# Step 4: Handle personas without prompts - set default values if needed (always run this)
op.execute(
"""
UPDATE persona
SET
system_prompt = COALESCE(system_prompt, ''),
task_prompt = COALESCE(task_prompt, '')
WHERE system_prompt IS NULL OR task_prompt IS NULL
"""
)
# Step 5: Drop the persona__prompt association table (if it exists)
if "persona__prompt" in existing_tables:
op.drop_table("persona__prompt")
# Step 6: Drop the prompt table (if it exists)
if "prompt" in existing_tables:
op.drop_table("prompt")
# Step 7: Make system_prompt and task_prompt non-nullable after migration (only if they exist)
op.alter_column(
"persona",
"system_prompt",
existing_type=sa.String(length=MAX_PROMPT_LENGTH),
nullable=False,
server_default=None,
)
op.alter_column(
"persona",
"task_prompt",
existing_type=sa.String(length=MAX_PROMPT_LENGTH),
nullable=False,
server_default=None,
)
def downgrade() -> None:
# Step 1: Recreate the prompt table
op.create_table(
"prompt",
sa.Column("id", sa.Integer(), nullable=False),
sa.Column("user_id", postgresql.UUID(as_uuid=True), nullable=True),
sa.Column("name", sa.String(), nullable=False),
sa.Column("description", sa.String(), nullable=False),
sa.Column("system_prompt", sa.String(length=MAX_PROMPT_LENGTH), nullable=False),
sa.Column("task_prompt", sa.String(length=MAX_PROMPT_LENGTH), nullable=False),
sa.Column(
"datetime_aware", sa.Boolean(), nullable=False, server_default="true"
),
sa.Column(
"default_prompt", sa.Boolean(), nullable=False, server_default="false"
),
sa.Column("deleted", sa.Boolean(), nullable=False, server_default="false"),
sa.ForeignKeyConstraint(["user_id"], ["user.id"], ondelete="CASCADE"),
sa.PrimaryKeyConstraint("id"),
)
# Step 2: Recreate the persona__prompt association table
op.create_table(
"persona__prompt",
sa.Column("persona_id", sa.Integer(), nullable=False),
sa.Column("prompt_id", sa.Integer(), nullable=False),
sa.ForeignKeyConstraint(
["persona_id"],
["persona.id"],
),
sa.ForeignKeyConstraint(
["prompt_id"],
["prompt.id"],
),
sa.PrimaryKeyConstraint("persona_id", "prompt_id"),
)
# Step 3: Migrate data back from persona to prompt table
op.execute(
"""
INSERT INTO prompt (
name,
description,
system_prompt,
task_prompt,
datetime_aware,
default_prompt,
deleted,
user_id
)
SELECT
CONCAT('Prompt for ', name),
description,
system_prompt,
task_prompt,
datetime_aware,
is_default_persona,
deleted,
user_id
FROM persona
WHERE system_prompt IS NOT NULL AND system_prompt != ''
RETURNING id, name
"""
)
# Step 4: Re-establish persona__prompt relationships
op.execute(
"""
INSERT INTO persona__prompt (persona_id, prompt_id)
SELECT
p.id as persona_id,
pr.id as prompt_id
FROM persona p
JOIN prompt pr ON pr.name = CONCAT('Prompt for ', p.name)
WHERE p.system_prompt IS NOT NULL AND p.system_prompt != ''
"""
)
# Step 5: Add prompt_id column back to chat_message
op.add_column("chat_message", sa.Column("prompt_id", sa.Integer(), nullable=True))
# Step 6: Re-establish foreign key constraint
op.create_foreign_key(
"chat_message__prompt_fk", "chat_message", "prompt", ["prompt_id"], ["id"]
)
# Step 7: Remove columns from persona table
op.drop_column("persona", "datetime_aware")
op.drop_column("persona", "task_prompt")
op.drop_column("persona", "system_prompt")

View File

@@ -0,0 +1,123 @@
"""add_mcp_auth_performer
Revision ID: b30353be4eec
Revises: 2b75d0a8ffcb
Create Date: 2025-09-13 14:58:08.413534
"""
from alembic import op
import sqlalchemy as sa
from onyx.db.enums import MCPAuthenticationPerformer, MCPTransport
# revision identifiers, used by Alembic.
revision = "b30353be4eec"
down_revision = "2b75d0a8ffcb"
branch_labels = None
depends_on = None
def upgrade() -> None:
"""moving to a better way of handling auth performer and transport"""
# Add nullable column first for backward compatibility
op.add_column(
"mcp_server",
sa.Column(
"auth_performer",
sa.Enum(MCPAuthenticationPerformer, native_enum=False),
nullable=True,
),
)
op.add_column(
"mcp_server",
sa.Column(
"transport",
sa.Enum(MCPTransport, native_enum=False),
nullable=True,
),
)
# # Backfill values using existing data and inference rules
bind = op.get_bind()
# 1) OAUTH servers are always PER_USER
bind.execute(
sa.text(
"""
UPDATE mcp_server
SET auth_performer = 'PER_USER'
WHERE auth_type = 'OAUTH'
"""
)
)
# 2) If there is no admin connection config, mark as ADMIN (and not set yet)
bind.execute(
sa.text(
"""
UPDATE mcp_server
SET auth_performer = 'ADMIN'
WHERE admin_connection_config_id IS NULL
AND auth_performer IS NULL
"""
)
)
# 3) If there exists any user-specific connection config (user_email != ''), mark as PER_USER
bind.execute(
sa.text(
"""
UPDATE mcp_server AS ms
SET auth_performer = 'PER_USER'
FROM mcp_connection_config AS mcc
WHERE mcc.mcp_server_id = ms.id
AND COALESCE(mcc.user_email, '') <> ''
AND ms.auth_performer IS NULL
"""
)
)
# 4) Default any remaining nulls to ADMIN (covers API_TOKEN admin-managed and NONE)
bind.execute(
sa.text(
"""
UPDATE mcp_server
SET auth_performer = 'ADMIN'
WHERE auth_performer IS NULL
"""
)
)
# Finally, make the column non-nullable
op.alter_column(
"mcp_server",
"auth_performer",
existing_type=sa.Enum(MCPAuthenticationPerformer, native_enum=False),
nullable=False,
)
# Backfill transport for existing rows to STREAMABLE_HTTP, then make non-nullable
bind.execute(
sa.text(
"""
UPDATE mcp_server
SET transport = 'STREAMABLE_HTTP'
WHERE transport IS NULL
"""
)
)
op.alter_column(
"mcp_server",
"transport",
existing_type=sa.Enum(MCPTransport, native_enum=False),
nullable=False,
)
def downgrade() -> None:
"""remove cols"""
op.drop_column("mcp_server", "transport")
op.drop_column("mcp_server", "auth_performer")

View File

@@ -0,0 +1,38 @@
"""Adding assistant-specific user preferences
Revision ID: b329d00a9ea6
Revises: f9b8c7d6e5a4
Create Date: 2025-08-26 23:14:44.592985
"""
from alembic import op
import fastapi_users_db_sqlalchemy
import sqlalchemy as sa
from sqlalchemy.dialects import postgresql
# revision identifiers, used by Alembic.
revision = "b329d00a9ea6"
down_revision = "f9b8c7d6e5a4"
branch_labels = None
depends_on = None
def upgrade() -> None:
op.create_table(
"assistant__user_specific_config",
sa.Column("assistant_id", sa.Integer(), nullable=False),
sa.Column(
"user_id",
fastapi_users_db_sqlalchemy.generics.GUID(),
nullable=False,
),
sa.Column("disabled_tool_ids", postgresql.ARRAY(sa.Integer()), nullable=False),
sa.ForeignKeyConstraint(["assistant_id"], ["persona.id"], ondelete="CASCADE"),
sa.ForeignKeyConstraint(["user_id"], ["user.id"], ondelete="CASCADE"),
sa.PrimaryKeyConstraint("assistant_id", "user_id"),
)
def downgrade() -> None:
op.drop_table("assistant__user_specific_config")

View File

@@ -0,0 +1,27 @@
"""add_user_oauth_token_to_slack_bot
Revision ID: b4ef3ae0bf6e
Revises: 505c488f6662
Create Date: 2025-08-26 17:47:41.788462
"""
from alembic import op
import sqlalchemy as sa
# revision identifiers, used by Alembic.
revision = "b4ef3ae0bf6e"
down_revision = "505c488f6662"
branch_labels = None
depends_on = None
def upgrade() -> None:
# Add user_token column to slack_bot table
op.add_column("slack_bot", sa.Column("user_token", sa.LargeBinary(), nullable=True))
def downgrade() -> None:
# Remove user_token column from slack_bot table
op.drop_column("slack_bot", "user_token")

View File

@@ -0,0 +1,33 @@
"""Pause finished user file connectors
Revision ID: b558f51620b4
Revises: 90e3b9af7da4
Create Date: 2025-08-15 17:17:02.456704
"""
from alembic import op
# revision identifiers, used by Alembic.
revision = "b558f51620b4"
down_revision = "90e3b9af7da4"
branch_labels = None
depends_on = None
def upgrade() -> None:
# Set all user file connector credential pairs with ACTIVE status to PAUSED
# This ensures user files don't continue to run indexing tasks after processing
op.execute(
"""
UPDATE connector_credential_pair
SET status = 'PAUSED'
WHERE is_user_file = true
AND status = 'ACTIVE'
"""
)
def downgrade() -> None:
pass

View File

@@ -0,0 +1,43 @@
"""adjust prompt length
Revision ID: b7ec9b5b505f
Revises: abbfec3a5ac5
Create Date: 2025-09-10 18:51:15.629197
"""
from alembic import op
import sqlalchemy as sa
# revision identifiers, used by Alembic.
revision = "b7ec9b5b505f"
down_revision = "abbfec3a5ac5"
branch_labels = None
depends_on = None
MAX_PROMPT_LENGTH = 5_000_000
def upgrade() -> None:
# NOTE: need to run this since the previous migration PREVIOUSLY set the length to 8000
op.alter_column(
"persona",
"system_prompt",
existing_type=sa.String(length=8000),
type_=sa.String(length=MAX_PROMPT_LENGTH),
existing_nullable=False,
)
op.alter_column(
"persona",
"task_prompt",
existing_type=sa.String(length=8000),
type_=sa.String(length=MAX_PROMPT_LENGTH),
existing_nullable=False,
)
def downgrade() -> None:
# Downgrade not necessary
pass

View File

@@ -0,0 +1,147 @@
"""migrate_agent_sub_questions_to_research_iterations
Revision ID: bd7c3bf8beba
Revises: f8a9b2c3d4e5
Create Date: 2025-08-18 11:33:27.098287
"""
from alembic import op
import sqlalchemy as sa
# revision identifiers, used by Alembic.
revision = "bd7c3bf8beba"
down_revision = "f8a9b2c3d4e5"
branch_labels = None
depends_on = None
def upgrade() -> None:
# Get connection to execute raw SQL
connection = op.get_bind()
# First, insert data into research_agent_iteration table
# This creates one iteration record per primary_question_id using the earliest time_created
connection.execute(
sa.text(
"""
INSERT INTO research_agent_iteration (primary_question_id, created_at, iteration_nr, purpose, reasoning)
SELECT
primary_question_id,
MIN(time_created) as created_at,
1 as iteration_nr,
'Generating and researching subquestions' as purpose,
'(No previous reasoning)' as reasoning
FROM agent__sub_question
JOIN chat_message on agent__sub_question.primary_question_id = chat_message.id
WHERE primary_question_id IS NOT NULL
AND chat_message.is_agentic = true
GROUP BY primary_question_id
ON CONFLICT DO NOTHING;
"""
)
)
# Then, insert data into research_agent_iteration_sub_step table
# This migrates each sub-question as a sub-step
connection.execute(
sa.text(
"""
INSERT INTO research_agent_iteration_sub_step (
primary_question_id,
iteration_nr,
iteration_sub_step_nr,
created_at,
sub_step_instructions,
sub_step_tool_id,
sub_answer,
cited_doc_results
)
SELECT
primary_question_id,
1 as iteration_nr,
level_question_num as iteration_sub_step_nr,
time_created as created_at,
sub_question as sub_step_instructions,
1 as sub_step_tool_id,
sub_answer,
sub_question_doc_results as cited_doc_results
FROM agent__sub_question
JOIN chat_message on agent__sub_question.primary_question_id = chat_message.id
WHERE chat_message.is_agentic = true
AND primary_question_id IS NOT NULL
ON CONFLICT DO NOTHING;
"""
)
)
# Update chat_message records: set legacy agentic type and answer purpose for existing agentic messages
connection.execute(
sa.text(
"""
UPDATE chat_message
SET research_answer_purpose = 'ANSWER'
WHERE is_agentic = true
AND research_type IS NULL and
message_type = 'ASSISTANT';
"""
)
)
connection.execute(
sa.text(
"""
UPDATE chat_message
SET research_type = 'LEGACY_AGENTIC'
WHERE is_agentic = true
AND research_type IS NULL;
"""
)
)
def downgrade() -> None:
# Get connection to execute raw SQL
connection = op.get_bind()
# Note: This downgrade removes all research agent iteration data
# There's no way to perfectly restore the original agent__sub_question data
# if it was deleted after this migration
# Delete all research_agent_iteration_sub_step records that were migrated
connection.execute(
sa.text(
"""
DELETE FROM research_agent_iteration_sub_step
USING chat_message
WHERE research_agent_iteration_sub_step.primary_question_id = chat_message.id
AND chat_message.research_type = 'LEGACY_AGENTIC';
"""
)
)
# Delete all research_agent_iteration records that were migrated
connection.execute(
sa.text(
"""
DELETE FROM research_agent_iteration
USING chat_message
WHERE research_agent_iteration.primary_question_id = chat_message.id
AND chat_message.research_type = 'LEGACY_AGENTIC';
"""
)
)
# Revert chat_message updates: clear research fields for legacy agentic messages
connection.execute(
sa.text(
"""
UPDATE chat_message
SET research_type = NULL,
research_answer_purpose = NULL
WHERE is_agentic = true
AND research_type = 'LEGACY_AGENTIC'
AND message_type = 'ASSISTANT';
"""
)
)

View File

@@ -0,0 +1,72 @@
"""personalization_user_info
Revision ID: c8a93a2af083
Revises: 6f4f86aef280
Create Date: 2025-10-14 15:59:03.577343
"""
from alembic import op
import sqlalchemy as sa
from sqlalchemy.dialects import postgresql
# revision identifiers, used by Alembic.
revision = "c8a93a2af083"
down_revision = "6f4f86aef280"
branch_labels = None
depends_on = None
def upgrade() -> None:
op.add_column(
"user",
sa.Column("personal_name", sa.String(), nullable=True),
)
op.add_column(
"user",
sa.Column("personal_role", sa.String(), nullable=True),
)
op.add_column(
"user",
sa.Column(
"use_memories",
sa.Boolean(),
nullable=False,
server_default=sa.true(),
),
)
op.create_table(
"memory",
sa.Column("id", sa.Integer(), nullable=False),
sa.Column("user_id", postgresql.UUID(as_uuid=True), nullable=False),
sa.Column("memory_text", sa.Text(), nullable=False),
sa.Column("conversation_id", postgresql.UUID(as_uuid=True), nullable=True),
sa.Column("message_id", sa.Integer(), nullable=True),
sa.Column(
"created_at",
sa.DateTime(timezone=True),
server_default=sa.func.now(),
nullable=False,
),
sa.Column(
"updated_at",
sa.DateTime(timezone=True),
server_default=sa.func.now(),
nullable=False,
),
sa.ForeignKeyConstraint(["user_id"], ["user.id"], ondelete="CASCADE"),
sa.PrimaryKeyConstraint("id"),
)
op.create_index("ix_memory_user_id", "memory", ["user_id"])
def downgrade() -> None:
op.drop_index("ix_memory_user_id", table_name="memory")
op.drop_table("memory")
op.drop_column("user", "use_memories")
op.drop_column("user", "personal_role")
op.drop_column("user", "personal_name")

View File

@@ -0,0 +1,315 @@
"""modify_file_store_for_external_storage
Revision ID: c9e2cd766c29
Revises: 03bf8be6b53a
Create Date: 2025-06-13 14:02:09.867679
"""
from alembic import op
import sqlalchemy as sa
from sqlalchemy.orm import Session
from sqlalchemy import text
from typing import cast, Any
from botocore.exceptions import ClientError
from onyx.db._deprecated.pg_file_store import delete_lobj_by_id, read_lobj
from onyx.file_store.file_store import get_s3_file_store
from shared_configs.contextvars import CURRENT_TENANT_ID_CONTEXTVAR
# revision identifiers, used by Alembic.
revision = "c9e2cd766c29"
down_revision = "03bf8be6b53a"
branch_labels = None
depends_on = None
def upgrade() -> None:
try:
# Modify existing file_store table to support external storage
op.rename_table("file_store", "file_record")
# Make lobj_oid nullable (for external storage files)
op.alter_column("file_record", "lobj_oid", nullable=True)
# Add external storage columns with generic names
op.add_column(
"file_record", sa.Column("bucket_name", sa.String(), nullable=True)
)
op.add_column(
"file_record", sa.Column("object_key", sa.String(), nullable=True)
)
# Add timestamps for tracking
op.add_column(
"file_record",
sa.Column(
"created_at",
sa.DateTime(timezone=True),
server_default=sa.func.now(),
nullable=False,
),
)
op.add_column(
"file_record",
sa.Column(
"updated_at",
sa.DateTime(timezone=True),
server_default=sa.func.now(),
nullable=False,
),
)
op.alter_column("file_record", "file_name", new_column_name="file_id")
except Exception as e:
if "does not exist" in str(e) or 'relation "file_store" does not exist' in str(
e
):
print(
f"Ran into error - {e}. Likely means we had a partial success in the past, continuing..."
)
else:
raise
print(
"External storage configured - migrating files from PostgreSQL to external storage..."
)
# if we fail midway through this, we'll have a partial success. Running the migration
# again should allow us to continue.
_migrate_files_to_external_storage()
print("File migration completed successfully!")
# Remove lobj_oid column
op.drop_column("file_record", "lobj_oid")
def downgrade() -> None:
"""Revert schema changes and migrate files from external storage back to PostgreSQL large objects."""
print(
"Reverting to PostgreSQL-backed file store migrating files from external storage …"
)
# 1. Ensure `lobj_oid` exists on the current `file_record` table (nullable for now).
op.add_column("file_record", sa.Column("lobj_oid", sa.Integer(), nullable=True))
# 2. Move content from external storage back into PostgreSQL large objects (table is still
# called `file_record` so application code continues to work during the copy).
try:
_migrate_files_to_postgres()
except Exception:
print("Error during downgrade migration, rolling back …")
op.drop_column("file_record", "lobj_oid")
raise
# 3. After migration every row should now have `lobj_oid` populated mark NOT NULL.
op.alter_column("file_record", "lobj_oid", nullable=False)
# 4. Remove columns that are only relevant to external storage.
op.drop_column("file_record", "updated_at")
op.drop_column("file_record", "created_at")
op.drop_column("file_record", "object_key")
op.drop_column("file_record", "bucket_name")
# 5. Rename `file_id` back to `file_name` (still on `file_record`).
op.alter_column("file_record", "file_id", new_column_name="file_name")
# 6. Finally, rename the table back to its original name expected by the legacy codebase.
op.rename_table("file_record", "file_store")
print(
"Downgrade migration completed files are now stored inside PostgreSQL again."
)
# -----------------------------------------------------------------------------
# Helper: migrate from external storage (S3/MinIO) back into PostgreSQL large objects
def _migrate_files_to_postgres() -> None:
"""Move any files whose content lives in external S3-compatible storage back into PostgreSQL.
The logic mirrors *inverse* of `_migrate_files_to_external_storage` used on upgrade.
"""
# Obtain DB session from Alembic context
bind = op.get_bind()
session = Session(bind=bind)
# Fetch rows that have external storage pointers (bucket/object_key not NULL)
result = session.execute(
text(
"SELECT file_id, bucket_name, object_key FROM file_record "
"WHERE bucket_name IS NOT NULL AND object_key IS NOT NULL"
)
)
files_to_migrate = [row[0] for row in result.fetchall()]
total_files = len(files_to_migrate)
if total_files == 0:
print("No files found in external storage to migrate back to PostgreSQL.")
return
print(f"Found {total_files} files to migrate back to PostgreSQL large objects.")
_set_tenant_contextvar(session)
migrated_count = 0
# only create external store if we have files to migrate. This line
# makes it so we need to have S3/MinIO configured to run this migration.
external_store = get_s3_file_store()
for i, file_id in enumerate(files_to_migrate, 1):
print(f"Migrating file {i}/{total_files}: {file_id}")
# Read file content from external storage (always binary)
try:
file_io = external_store.read_file(
file_id=file_id, mode="b", use_tempfile=True
)
file_io.seek(0)
# Import lazily to avoid circular deps at Alembic runtime
from onyx.db._deprecated.pg_file_store import (
create_populate_lobj,
) # noqa: E402
# Create new Postgres large object and populate it
lobj_oid = create_populate_lobj(content=file_io, db_session=session)
# Update DB row: set lobj_oid, clear bucket/object_key
session.execute(
text(
"UPDATE file_record SET lobj_oid = :lobj_oid, bucket_name = NULL, "
"object_key = NULL WHERE file_id = :file_id"
),
{"lobj_oid": lobj_oid, "file_id": file_id},
)
except ClientError as e:
if "NoSuchKey" in str(e):
print(
f"File {file_id} not found in external storage. Deleting from database."
)
session.execute(
text("DELETE FROM file_record WHERE file_id = :file_id"),
{"file_id": file_id},
)
else:
raise
migrated_count += 1
print(f"✓ Successfully migrated file {i}/{total_files}: {file_id}")
# Flush the SQLAlchemy session so statements are sent to the DB, but **do not**
# commit the transaction. The surrounding Alembic migration will commit once
# the *entire* downgrade succeeds. This keeps the whole downgrade atomic and
# avoids leaving the database in a partially-migrated state if a later schema
# operation fails.
session.flush()
print(
f"Migration back to PostgreSQL completed: {migrated_count} files staged for commit."
)
def _migrate_files_to_external_storage() -> None:
"""Migrate files from PostgreSQL large objects to external storage"""
# Get database session
bind = op.get_bind()
session = Session(bind=bind)
external_store = get_s3_file_store()
# Find all files currently stored in PostgreSQL (lobj_oid is not null)
result = session.execute(
text(
"SELECT file_id FROM file_record WHERE lobj_oid IS NOT NULL "
"AND bucket_name IS NULL AND object_key IS NULL"
)
)
files_to_migrate = [row[0] for row in result.fetchall()]
total_files = len(files_to_migrate)
if total_files == 0:
print("No files found in PostgreSQL storage to migrate.")
return
# might need to move this above the if statement when creating a new multi-tenant
# system. VERY extreme edge case.
external_store.initialize()
print(f"Found {total_files} files to migrate from PostgreSQL to external storage.")
_set_tenant_contextvar(session)
migrated_count = 0
for i, file_id in enumerate(files_to_migrate, 1):
print(f"Migrating file {i}/{total_files}: {file_id}")
# Read file record to get metadata
file_record = session.execute(
text("SELECT * FROM file_record WHERE file_id = :file_id"),
{"file_id": file_id},
).fetchone()
if file_record is None:
print(f"File {file_id} not found in PostgreSQL storage.")
continue
lobj_id = cast(int, file_record.lobj_oid) # type: ignore
file_metadata = cast(Any, file_record.file_metadata) # type: ignore
# Read file content from PostgreSQL
try:
file_content = read_lobj(
lobj_id, db_session=session, mode="b", use_tempfile=True
)
except Exception as e:
if "large object" in str(e) and "does not exist" in str(e):
print(f"File {file_id} not found in PostgreSQL storage.")
continue
else:
raise
# Handle file_metadata type conversion
file_metadata = None
if file_metadata is not None:
if isinstance(file_metadata, dict):
file_metadata = file_metadata
else:
# Convert other types to dict if possible, otherwise None
try:
file_metadata = dict(file_record.file_metadata) # type: ignore
except (TypeError, ValueError):
file_metadata = None
# Save to external storage (this will handle the database record update and cleanup)
# NOTE: this WILL .commit() the transaction.
external_store.save_file(
file_id=file_id,
content=file_content,
display_name=file_record.display_name,
file_origin=file_record.file_origin,
file_type=file_record.file_type,
file_metadata=file_metadata,
)
delete_lobj_by_id(lobj_id, db_session=session)
migrated_count += 1
print(f"✓ Successfully migrated file {i}/{total_files}: {file_id}")
# See note above flush but do **not** commit so the outer Alembic transaction
# controls atomicity.
session.flush()
print(
f"Migration completed: {migrated_count} files staged for commit to external storage."
)
def _set_tenant_contextvar(session: Session) -> None:
"""Set the tenant contextvar to the default schema"""
current_tenant = session.execute(text("SELECT current_schema()")).scalar()
print(f"Migrating files for tenant: {current_tenant}")
CURRENT_TENANT_ID_CONTEXTVAR.set(current_tenant)

View File

@@ -0,0 +1,128 @@
"""add_cascade_deletes_to_agent_tables
Revision ID: ca04500b9ee8
Revises: 238b84885828
Create Date: 2025-05-30 16:03:51.112263
"""
from alembic import op
# revision identifiers, used by Alembic.
revision = "ca04500b9ee8"
down_revision = "238b84885828"
branch_labels = None
depends_on = None
def upgrade() -> None:
# Drop existing foreign key constraints
op.drop_constraint(
"agent__sub_question_primary_question_id_fkey",
"agent__sub_question",
type_="foreignkey",
)
op.drop_constraint(
"agent__sub_query_parent_question_id_fkey",
"agent__sub_query",
type_="foreignkey",
)
op.drop_constraint(
"chat_message__standard_answer_chat_message_id_fkey",
"chat_message__standard_answer",
type_="foreignkey",
)
op.drop_constraint(
"agent__sub_query__search_doc_sub_query_id_fkey",
"agent__sub_query__search_doc",
type_="foreignkey",
)
# Recreate foreign key constraints with CASCADE delete
op.create_foreign_key(
"agent__sub_question_primary_question_id_fkey",
"agent__sub_question",
"chat_message",
["primary_question_id"],
["id"],
ondelete="CASCADE",
)
op.create_foreign_key(
"agent__sub_query_parent_question_id_fkey",
"agent__sub_query",
"agent__sub_question",
["parent_question_id"],
["id"],
ondelete="CASCADE",
)
op.create_foreign_key(
"chat_message__standard_answer_chat_message_id_fkey",
"chat_message__standard_answer",
"chat_message",
["chat_message_id"],
["id"],
ondelete="CASCADE",
)
op.create_foreign_key(
"agent__sub_query__search_doc_sub_query_id_fkey",
"agent__sub_query__search_doc",
"agent__sub_query",
["sub_query_id"],
["id"],
ondelete="CASCADE",
)
def downgrade() -> None:
# Drop CASCADE foreign key constraints
op.drop_constraint(
"agent__sub_question_primary_question_id_fkey",
"agent__sub_question",
type_="foreignkey",
)
op.drop_constraint(
"agent__sub_query_parent_question_id_fkey",
"agent__sub_query",
type_="foreignkey",
)
op.drop_constraint(
"chat_message__standard_answer_chat_message_id_fkey",
"chat_message__standard_answer",
type_="foreignkey",
)
op.drop_constraint(
"agent__sub_query__search_doc_sub_query_id_fkey",
"agent__sub_query__search_doc",
type_="foreignkey",
)
# Recreate foreign key constraints without CASCADE delete
op.create_foreign_key(
"agent__sub_question_primary_question_id_fkey",
"agent__sub_question",
"chat_message",
["primary_question_id"],
["id"],
)
op.create_foreign_key(
"agent__sub_query_parent_question_id_fkey",
"agent__sub_query",
"agent__sub_question",
["parent_question_id"],
["id"],
)
op.create_foreign_key(
"chat_message__standard_answer_chat_message_id_fkey",
"chat_message__standard_answer",
"chat_message",
["chat_message_id"],
["id"],
)
op.create_foreign_key(
"agent__sub_query__search_doc_sub_query_id_fkey",
"agent__sub_query__search_doc",
"agent__sub_query",
["sub_query_id"],
["id"],
)

View File

@@ -0,0 +1,29 @@
"""kgentity_parent
Revision ID: cec7ec36c505
Revises: 495cb26ce93e
Create Date: 2025-06-07 20:07:46.400770
"""
from alembic import op
import sqlalchemy as sa
# revision identifiers, used by Alembic.
revision = "cec7ec36c505"
down_revision = "495cb26ce93e"
branch_labels = None
depends_on = None
def upgrade() -> None:
op.add_column(
"kg_entity",
sa.Column("parent_key", sa.String(), nullable=True, index=True),
)
# NOTE: you will have to reindex the KG after this migration as the parent_key will be null
def downgrade() -> None:
op.drop_column("kg_entity", "parent_key")

View File

@@ -0,0 +1,152 @@
"""seed_builtin_tools
Revision ID: d09fc20a3c66
Revises: b7ec9b5b505f
Create Date: 2025-09-09 19:32:16.824373
"""
from alembic import op
import sqlalchemy as sa
# revision identifiers, used by Alembic.
revision = "d09fc20a3c66"
down_revision = "b7ec9b5b505f"
branch_labels = None
depends_on = None
# Tool definitions - core tools that should always be seeded
# Names/in_code_tool_id are the same as the class names in the tool_implementations package
BUILT_IN_TOOLS = [
{
"name": "SearchTool",
"display_name": "Internal Search",
"description": "The Search Action allows the Assistant to search through connected knowledge to help build an answer.",
"in_code_tool_id": "SearchTool",
},
{
"name": "ImageGenerationTool",
"display_name": "Image Generation",
"description": (
"The Image Generation Action allows the assistant to use DALL-E 3 or GPT-IMAGE-1 to generate images. "
"The action will be used when the user asks the assistant to generate an image."
),
"in_code_tool_id": "ImageGenerationTool",
},
{
"name": "WebSearchTool",
"display_name": "Web Search",
"description": (
"The Web Search Action allows the assistant "
"to perform internet searches for up-to-date information."
),
"in_code_tool_id": "WebSearchTool",
},
{
"name": "KnowledgeGraphTool",
"display_name": "Knowledge Graph Search",
"description": (
"The Knowledge Graph Search Action allows the assistant to search the "
"Knowledge Graph for information. This tool can (for now) only be active in the KG Beta Assistant, "
"and it requires the Knowledge Graph to be enabled."
),
"in_code_tool_id": "KnowledgeGraphTool",
},
{
"name": "OktaProfileTool",
"display_name": "Okta Profile",
"description": (
"The Okta Profile Action allows the assistant to fetch the current user's information from Okta. "
"This may include the user's name, email, phone number, address, and other details such as their "
"manager and direct reports."
),
"in_code_tool_id": "OktaProfileTool",
},
]
def upgrade() -> None:
conn = op.get_bind()
# Start transaction
conn.execute(sa.text("BEGIN"))
try:
# Get existing tools to check what already exists
existing_tools = conn.execute(
sa.text(
"SELECT in_code_tool_id FROM tool WHERE in_code_tool_id IS NOT NULL"
)
).fetchall()
existing_tool_ids = {row[0] for row in existing_tools}
# Insert or update built-in tools
for tool in BUILT_IN_TOOLS:
in_code_id = tool["in_code_tool_id"]
# Handle historical rename: InternetSearchTool -> WebSearchTool
if (
in_code_id == "WebSearchTool"
and "WebSearchTool" not in existing_tool_ids
and "InternetSearchTool" in existing_tool_ids
):
# Rename the existing InternetSearchTool row in place and update fields
conn.execute(
sa.text(
"""
UPDATE tool
SET name = :name,
display_name = :display_name,
description = :description,
in_code_tool_id = :in_code_tool_id
WHERE in_code_tool_id = 'InternetSearchTool'
"""
),
tool,
)
# Keep the local view of existing ids in sync to avoid duplicate insert
existing_tool_ids.discard("InternetSearchTool")
existing_tool_ids.add("WebSearchTool")
continue
if in_code_id in existing_tool_ids:
# Update existing tool
conn.execute(
sa.text(
"""
UPDATE tool
SET name = :name,
display_name = :display_name,
description = :description
WHERE in_code_tool_id = :in_code_tool_id
"""
),
tool,
)
else:
# Insert new tool
conn.execute(
sa.text(
"""
INSERT INTO tool (name, display_name, description, in_code_tool_id)
VALUES (:name, :display_name, :description, :in_code_tool_id)
"""
),
tool,
)
# Commit transaction
conn.execute(sa.text("COMMIT"))
except Exception as e:
# Rollback on error
conn.execute(sa.text("ROLLBACK"))
raise e
def downgrade() -> None:
# We don't remove the tools on downgrade since it's totally fine to just
# have them around. If we upgrade again, it will be a no-op.
pass

View File

@@ -0,0 +1,57 @@
"""Update status length
Revision ID: d961aca62eb3
Revises: cf90764725d8
Create Date: 2025-03-23 16:10:05.683965
"""
from alembic import op
import sqlalchemy as sa
# revision identifiers, used by Alembic.
revision = "d961aca62eb3"
down_revision = "cf90764725d8"
branch_labels = None
depends_on = None
def upgrade() -> None:
# Drop the existing enum type constraint
op.execute("ALTER TABLE connector_credential_pair ALTER COLUMN status TYPE varchar")
# Create new enum type with all values
op.execute(
"ALTER TABLE connector_credential_pair ALTER COLUMN status TYPE VARCHAR(20) USING status::varchar(20)"
)
# Update the enum type to include all possible values
op.alter_column(
"connector_credential_pair",
"status",
type_=sa.Enum(
"SCHEDULED",
"INITIAL_INDEXING",
"ACTIVE",
"PAUSED",
"DELETING",
"INVALID",
name="connectorcredentialpairstatus",
native_enum=False,
),
existing_type=sa.String(20),
nullable=False,
)
op.add_column(
"connector_credential_pair",
sa.Column(
"in_repeated_error_state", sa.Boolean, default=False, server_default="false"
),
)
def downgrade() -> None:
# no need to convert back to the old enum type, since we're not using it anymore
op.drop_column("connector_credential_pair", "in_repeated_error_state")

View File

@@ -11,7 +11,7 @@ import sqlalchemy as sa
import json
from onyx.configs.constants import DocumentSource
from onyx.connectors.onyx_jira.utils import extract_jira_project
from onyx.connectors.jira.utils import extract_jira_project
# revision identifiers, used by Alembic.
@@ -21,6 +21,9 @@ branch_labels = None
depends_on = None
PRESERVED_CONFIG_KEYS = ["comment_email_blacklist", "batch_size", "labels_to_skip"]
def upgrade() -> None:
# Get all Jira connectors
conn = op.get_bind()
@@ -62,6 +65,9 @@ def upgrade() -> None:
f"WARNING: Jira connector {connector_id} has no project URL configured"
)
continue
for old_key in PRESERVED_CONFIG_KEYS:
if old_key in old_config:
new_config[old_key] = old_config[old_key]
# Update the connector config
conn.execute(
@@ -108,6 +114,10 @@ def downgrade() -> None:
else:
continue
for old_key in PRESERVED_CONFIG_KEYS:
if old_key in new_config:
old_config[old_key] = new_config[old_key]
# Update the connector config
conn.execute(
sa.text(
@@ -117,5 +127,5 @@ def downgrade() -> None:
WHERE id = :id
"""
),
{"id": connector_id, "old_config": old_config},
{"id": connector_id, "old_config": json.dumps(old_config)},
)

View File

@@ -10,12 +10,19 @@ from alembic import op
import sqlalchemy as sa
from sqlalchemy import table, column, String, Integer, Boolean
from onyx.db.search_settings import (
get_new_default_embedding_model,
get_old_default_embedding_model,
user_has_overridden_embedding_model,
)
from onyx.configs.model_configs import ASYM_PASSAGE_PREFIX
from onyx.configs.model_configs import ASYM_QUERY_PREFIX
from onyx.configs.model_configs import DOC_EMBEDDING_DIM
from onyx.configs.model_configs import DOCUMENT_ENCODER_MODEL
from onyx.configs.model_configs import NORMALIZE_EMBEDDINGS
from onyx.configs.model_configs import OLD_DEFAULT_DOCUMENT_ENCODER_MODEL
from onyx.configs.model_configs import OLD_DEFAULT_MODEL_DOC_EMBEDDING_DIM
from onyx.configs.model_configs import OLD_DEFAULT_MODEL_NORMALIZE_EMBEDDINGS
from onyx.db.enums import EmbeddingPrecision
from onyx.db.models import IndexModelStatus
from onyx.db.search_settings import user_has_overridden_embedding_model
from onyx.indexing.models import IndexingSetting
from onyx.natural_language_processing.search_nlp_models import clean_model_name
# revision identifiers, used by Alembic.
revision = "dbaa756c2ccf"
@@ -24,6 +31,47 @@ branch_labels: None = None
depends_on: None = None
def _get_old_default_embedding_model() -> IndexingSetting:
is_overridden = user_has_overridden_embedding_model()
return IndexingSetting(
model_name=(
DOCUMENT_ENCODER_MODEL
if is_overridden
else OLD_DEFAULT_DOCUMENT_ENCODER_MODEL
),
model_dim=(
DOC_EMBEDDING_DIM if is_overridden else OLD_DEFAULT_MODEL_DOC_EMBEDDING_DIM
),
embedding_precision=(EmbeddingPrecision.FLOAT),
normalize=(
NORMALIZE_EMBEDDINGS
if is_overridden
else OLD_DEFAULT_MODEL_NORMALIZE_EMBEDDINGS
),
query_prefix=(ASYM_QUERY_PREFIX if is_overridden else ""),
passage_prefix=(ASYM_PASSAGE_PREFIX if is_overridden else ""),
index_name="danswer_chunk",
multipass_indexing=False,
enable_contextual_rag=False,
api_url=None,
)
def _get_new_default_embedding_model() -> IndexingSetting:
return IndexingSetting(
model_name=DOCUMENT_ENCODER_MODEL,
model_dim=DOC_EMBEDDING_DIM,
embedding_precision=(EmbeddingPrecision.BFLOAT16),
normalize=NORMALIZE_EMBEDDINGS,
query_prefix=ASYM_QUERY_PREFIX,
passage_prefix=ASYM_PASSAGE_PREFIX,
index_name=f"danswer_chunk_{clean_model_name(DOCUMENT_ENCODER_MODEL)}",
multipass_indexing=False,
enable_contextual_rag=False,
api_url=None,
)
def upgrade() -> None:
op.create_table(
"embedding_model",
@@ -61,7 +109,7 @@ def upgrade() -> None:
# the user selected via env variables before this change. This is needed since
# all index_attempts must be associated with an embedding model, so without this
# we will run into violations of non-null contraints
old_embedding_model = get_old_default_embedding_model()
old_embedding_model = _get_old_default_embedding_model()
op.bulk_insert(
EmbeddingModel,
[
@@ -79,7 +127,7 @@ def upgrade() -> None:
# if the user has not overridden the default embedding model via env variables,
# insert the new default model into the database to auto-upgrade them
if not user_has_overridden_embedding_model():
new_embedding_model = get_new_default_embedding_model()
new_embedding_model = _get_new_default_embedding_model()
op.bulk_insert(
EmbeddingModel,
[

View File

@@ -18,11 +18,13 @@ depends_on: None = None
def upgrade() -> None:
op.execute("DROP TABLE IF EXISTS document CASCADE")
op.create_table(
"document",
sa.Column("id", sa.String(), nullable=False),
sa.PrimaryKeyConstraint("id"),
)
op.execute("DROP TABLE IF EXISTS chunk CASCADE")
op.create_table(
"chunk",
sa.Column("id", sa.String(), nullable=False),
@@ -43,6 +45,7 @@ def upgrade() -> None:
),
sa.PrimaryKeyConstraint("id", "document_store_type"),
)
op.execute("DROP TABLE IF EXISTS deletion_attempt CASCADE")
op.create_table(
"deletion_attempt",
sa.Column("id", sa.Integer(), nullable=False),
@@ -84,6 +87,7 @@ def upgrade() -> None:
),
sa.PrimaryKeyConstraint("id"),
)
op.execute("DROP TABLE IF EXISTS document_by_connector_credential_pair CASCADE")
op.create_table(
"document_by_connector_credential_pair",
sa.Column("id", sa.String(), nullable=False),
@@ -106,7 +110,10 @@ def upgrade() -> None:
def downgrade() -> None:
# upstream tables first
op.drop_table("document_by_connector_credential_pair")
op.drop_table("deletion_attempt")
op.drop_table("chunk")
op.drop_table("document")
# Alembic op.drop_table() has no "cascade" flag issue raw SQL
op.execute("DROP TABLE IF EXISTS document CASCADE")

View File

@@ -0,0 +1,30 @@
"""add research_answer_purpose to chat_message
Revision ID: f8a9b2c3d4e5
Revises: 5ae8240accb3
Create Date: 2025-01-27 12:00:00.000000
"""
from alembic import op
import sqlalchemy as sa
# revision identifiers, used by Alembic.
revision = "f8a9b2c3d4e5"
down_revision = "5ae8240accb3"
branch_labels = None
depends_on = None
def upgrade() -> None:
# Add research_answer_purpose column to chat_message table
op.add_column(
"chat_message",
sa.Column("research_answer_purpose", sa.String(), nullable=True),
)
def downgrade() -> None:
# Remove research_answer_purpose column from chat_message table
op.drop_column("chat_message", "research_answer_purpose")

View File

@@ -0,0 +1,69 @@
"""remove foreign key constraints from research_agent_iteration_sub_step
Revision ID: f9b8c7d6e5a4
Revises: bd7c3bf8beba
Create Date: 2025-01-27 12:00:00.000000
"""
from alembic import op
import sqlalchemy as sa
# revision identifiers, used by Alembic.
revision = "f9b8c7d6e5a4"
down_revision = "bd7c3bf8beba"
branch_labels = None
depends_on = None
def upgrade() -> None:
# Drop the existing foreign key constraint for parent_question_id
op.drop_constraint(
"research_agent_iteration_sub_step_parent_question_id_fkey",
"research_agent_iteration_sub_step",
type_="foreignkey",
)
# Drop the parent_question_id column entirely
op.drop_column("research_agent_iteration_sub_step", "parent_question_id")
# Drop the foreign key constraint for primary_question_id to chat_message.id
# (keep the column as it's needed for the composite foreign key)
op.drop_constraint(
"research_agent_iteration_sub_step_primary_question_id_fkey",
"research_agent_iteration_sub_step",
type_="foreignkey",
)
def downgrade() -> None:
# Restore the foreign key constraint for primary_question_id to chat_message.id
op.create_foreign_key(
"research_agent_iteration_sub_step_primary_question_id_fkey",
"research_agent_iteration_sub_step",
"chat_message",
["primary_question_id"],
["id"],
ondelete="CASCADE",
)
# Add back the parent_question_id column
op.add_column(
"research_agent_iteration_sub_step",
sa.Column(
"parent_question_id",
sa.Integer(),
nullable=True,
),
)
# Restore the foreign key constraint pointing to research_agent_iteration_sub_step.id
op.create_foreign_key(
"research_agent_iteration_sub_step_parent_question_id_fkey",
"research_agent_iteration_sub_step",
"research_agent_iteration_sub_step",
["parent_question_id"],
["id"],
ondelete="CASCADE",
)

View File

@@ -8,7 +8,7 @@ from sqlalchemy.ext.asyncio import create_async_engine
from sqlalchemy.schema import SchemaItem
from alembic import context
from onyx.db.engine import build_connection_string
from onyx.db.engine.sql_engine import build_connection_string
from onyx.db.models import PublicBase
# this is the Alembic Config object, which provides

View File

@@ -0,0 +1,80 @@
"""add_db_readonly_user
Revision ID: 3b9f09038764
Revises: 3b45e0018bf1
Create Date: 2025-05-11 11:05:11.436977
"""
from sqlalchemy import text
from alembic import op
from onyx.configs.app_configs import DB_READONLY_PASSWORD
from onyx.configs.app_configs import DB_READONLY_USER
from shared_configs.configs import MULTI_TENANT
# revision identifiers, used by Alembic.
revision = "3b9f09038764"
down_revision = "3b45e0018bf1"
branch_labels = None
depends_on = None
def upgrade() -> None:
if MULTI_TENANT:
# Enable pg_trgm extension if not already enabled
op.execute("CREATE EXTENSION IF NOT EXISTS pg_trgm")
# Create read-only db user here only in multi-tenant mode. For single-tenant mode,
# the user is created in the standard migration.
if not (DB_READONLY_USER and DB_READONLY_PASSWORD):
raise Exception("DB_READONLY_USER or DB_READONLY_PASSWORD is not set")
op.execute(
text(
f"""
DO $$
BEGIN
-- Check if the read-only user already exists
IF NOT EXISTS (SELECT FROM pg_catalog.pg_roles WHERE rolname = '{DB_READONLY_USER}') THEN
-- Create the read-only user with the specified password
EXECUTE format('CREATE USER %I WITH PASSWORD %L', '{DB_READONLY_USER}', '{DB_READONLY_PASSWORD}');
-- First revoke all privileges to ensure a clean slate
EXECUTE format('REVOKE ALL ON DATABASE %I FROM %I', current_database(), '{DB_READONLY_USER}');
-- Grant only the CONNECT privilege to allow the user to connect to the database
-- but not perform any operations without additional specific grants
EXECUTE format('GRANT CONNECT ON DATABASE %I TO %I', current_database(), '{DB_READONLY_USER}');
END IF;
END
$$;
"""
)
)
def downgrade() -> None:
if MULTI_TENANT:
# Drop read-only db user here only in single tenant mode. For multi-tenant mode,
# the user is dropped in the alembic_tenants migration.
op.execute(
text(
f"""
DO $$
BEGIN
IF EXISTS (SELECT FROM pg_catalog.pg_roles WHERE rolname = '{DB_READONLY_USER}') THEN
-- First revoke all privileges from the database
EXECUTE format('REVOKE ALL ON DATABASE %I FROM %I', current_database(), '{DB_READONLY_USER}');
-- Then revoke all privileges from the public schema
EXECUTE format('REVOKE ALL ON SCHEMA public FROM %I', '{DB_READONLY_USER}');
-- Then drop the user
EXECUTE format('DROP USER %I', '{DB_READONLY_USER}');
END IF;
END
$$;
"""
)
)
op.execute(text("DROP EXTENSION IF EXISTS pg_trgm"))

View File

@@ -1,12 +1,10 @@
from sqlalchemy.orm import Session
from ee.onyx.db.external_perm import fetch_external_groups_for_user
from ee.onyx.db.external_perm import fetch_public_external_group_ids
from ee.onyx.db.user_group import fetch_user_groups_for_documents
from ee.onyx.db.user_group import fetch_user_groups_for_user
from ee.onyx.external_permissions.post_query_censoring import (
DOC_SOURCE_TO_CHUNK_CENSORING_FUNCTION,
)
from ee.onyx.external_permissions.sync_params import DOC_PERMISSIONS_FUNC_MAP
from ee.onyx.external_permissions.sync_params import get_source_perm_sync_config
from onyx.access.access import (
_get_access_for_documents as get_access_for_documents_without_groups,
)
@@ -17,6 +15,10 @@ from onyx.access.utils import prefix_user_group
from onyx.db.document import get_document_sources
from onyx.db.document import get_documents_by_ids
from onyx.db.models import User
from onyx.utils.logger import setup_logger
logger = setup_logger()
def _get_access_for_document(
@@ -63,13 +65,21 @@ def _get_access_for_documents(
document_ids=document_ids,
)
all_public_ext_u_group_ids = set(fetch_public_external_group_ids(db_session))
access_map = {}
for document_id, non_ee_access in non_ee_access_dict.items():
document = doc_id_map[document_id]
source = doc_id_to_source_map.get(document_id)
if source is None:
logger.error(f"Document {document_id} has no source")
continue
perm_sync_config = get_source_perm_sync_config(source)
is_only_censored = (
source in DOC_SOURCE_TO_CHUNK_CENSORING_FUNCTION
and source not in DOC_PERMISSIONS_FUNC_MAP
perm_sync_config
and perm_sync_config.censoring_config is not None
and perm_sync_config.doc_sync_config is None
)
ext_u_emails = (
@@ -89,7 +99,10 @@ def _get_access_for_documents(
# If its censored, then it's public anywhere during the search and then permissions are
# applied after the search
is_public_anywhere = (
document.is_public or non_ee_access.is_public or is_only_censored
document.is_public
or non_ee_access.is_public
or is_only_censored
or any(u_group in all_public_ext_u_group_ids for u_group in ext_u_groups)
)
# To avoid collisions of group namings between connectors, they need to be prefixed

Some files were not shown because too many files have changed in this diff Show More