Compare commits

..

34 Commits

Author SHA1 Message Date
pablonyx
03dfa0fcc0 update 2025-04-05 13:16:52 -07:00
pablonyx
0acd50b75d docx bugfix 2025-04-04 18:20:31 -07:00
pablonyx
c3c9a0e57c Docx parsing (#4455)
* looks okay

* k

* k

* k

* update values

* k

* quick fix
2025-04-04 23:36:43 +00:00
pablonyx
ef978aea97 Additional ACL Tests + Slackbot fix (#4430)
* try turning drive perm sync on

* try passing in env var

* add some logs

* Update pr-integration-tests.yml

* revert "Update pr-integration-tests.yml"

This reverts commit 76a44adbfe.

* Revert "add some logs"

This reverts commit ab9e6bcfb1.

* Revert "try passing in env var"

This reverts commit 9c0b6162ea.

* Revert "try turning drive perm sync on"

This reverts commit 2d35f61f42.

* try slack connector

* k

* update

* remove logs

* remove more logs

* nit

* k

* k

* address nits

* run test with additional logs

* Revert "run test with additional logs"

This reverts commit 1397a2c4a0.

* Revert "address nits"

This reverts commit d5e24b019d.
2025-04-04 22:00:17 +00:00
rkuo-danswer
15ab0586df handle gong api race condition (#4457)
* working around a gong race condition in their api

* add back gong basic test

* formatting

* add the call index

---------

Co-authored-by: Richard Kuo (Onyx) <rkuo@onyx.app>
2025-04-04 19:33:47 +00:00
rkuo-danswer
839c8611b7 Bugfix/salesforce (#4335)
* add some gc

* small refactoring for temp directories

* WIP

* add some gc collects and size calculations

* un-xfail

* fix salesforce test

* loose check for number of docs

* adjust test again

* cleanup

* nuke directory param, remove using sqlite db to cache email / id mappings

---------

Co-authored-by: Richard Kuo (Onyx) <rkuo@onyx.app>
2025-04-04 16:21:34 +00:00
joachim-danswer
68f9f157a6 Adding research topics for better search context (#4448)
* research topics addition

* allow for question to overwrite research area
2025-04-04 09:53:39 -07:00
SubashMohan
9dd56a5c80 Enhance Highspot connector with error handling and add unit tests (#4454)
* Enhance Highspot connector with error handling and add unit tests for poll_source functionality

* Fix file extension validation logic to allow either plain text or document format
2025-04-04 09:53:16 -07:00
pablonyx
842a73a242 Mock connector fix (#4446) 2025-04-04 09:26:10 -07:00
Weves
c04c1ea31b Fix onyx_config.jsonl 2025-04-03 22:44:56 -07:00
Chris Weaver
2380c2266c Infra and Deployment for ECS Fargate (#4449)
* Infra and Deployment for ECS Fargate
---------

Co-authored-by: jpb80 <jordan.buttkevitz@gmail.com>
2025-04-03 22:43:56 -07:00
pablonyx
b02af9b280 Div Con (#4442)
* base setup

* Improvements + time boxing

* time box fix

* mypy fix

* EL Comments

* CW comments

* date awareness

---------

Co-authored-by: joachim-danswer <joachim@danswer.ai>
2025-04-04 00:52:00 +00:00
rkuo-danswer
42938dcf62 Bugfix/gong tweaks (#4444)
* gong debugging

* add retries via class level session, add debugging

* add gong connector test

---------

Co-authored-by: Richard Kuo (Onyx) <rkuo@onyx.app>
2025-04-03 22:22:45 +00:00
pablonyx
93886f0e2c Assistant Prompt length + client side (#4433) 2025-04-03 11:26:53 -07:00
rkuo-danswer
8c3a953b7a add prometheus metrics endpoints via helper package (#4436)
* add prometheus metrics endpoints via helper package

* model server specific requirements

* mark as public endpoint

---------

Co-authored-by: Richard Kuo (Onyx) <rkuo@onyx.app>
2025-04-03 16:52:05 +00:00
evan-danswer
54b883d0ca fix large docs selected in chat pruning (#4412)
* fix large docs selected in chat pruning

* better approach to length restriction

* comments

* comments

* fix unit tests and minor pruning bug

* remove prints
2025-04-03 15:48:10 +00:00
pablonyx
91faac5447 minor fix (#4435) 2025-04-03 15:00:27 +00:00
Chris Weaver
1d8f9fc39d Fix weird re-index state (#4439)
* Fix weird re-index state

* Address rkuo's comments
2025-04-03 02:16:34 +00:00
Weves
9390de21e5 More logging on confluence space permissions 2025-04-02 20:01:38 -07:00
rkuo-danswer
3a33433fc9 unit tests for chunk censoring (#4434)
* unit tests for chunk censoring

* type hints for mypy

* pytestification

---------

Co-authored-by: Richard Kuo (Onyx) <rkuo@onyx.app>
2025-04-03 01:28:54 +00:00
Chris Weaver
c4865d57b1 Fix tons of users w/o drive access causing timeouts (#4437) 2025-04-03 00:01:05 +00:00
rkuo-danswer
81d04db08f Feature/request id middleware 2 (#4427)
* stubbing out request id

* passthru or create request id's in api and model server

* add onyx request id

* get request id logging into uvicorn

* no logs

* change prefixes

* fix comment

* docker image needs specific shared files

---------

Co-authored-by: Richard Kuo (Onyx) <rkuo@onyx.app>
2025-04-02 22:30:03 +00:00
rkuo-danswer
d50a17db21 add filter unit tests (#4421)
* add filter unit tests

* fix tests

---------

Co-authored-by: Richard Kuo (Onyx) <rkuo@onyx.app>
2025-04-02 20:26:25 +00:00
pablonyx
dc5a1e8fd0 add more flexible vision support check (#4429) 2025-04-02 18:11:33 +00:00
pablonyx
c0b3681650 update (#4428) 2025-04-02 18:09:44 +00:00
Chris Weaver
7ec04484d4 Another fix for Salesforce perm sync (#4432)
* Another fix for Salesforce perm sync

* typing
2025-04-02 11:08:40 -07:00
Weves
1cf966ecc1 Fix Salesforce perm sync 2025-04-02 10:47:26 -07:00
rkuo-danswer
8a8526dbbb harden join function (#4424)
* harden join function

* remove log spam

* use time.monotonic

* add pid logging

* client only celery app

---------

Co-authored-by: Richard Kuo (Onyx) <rkuo@onyx.app>
2025-04-02 01:04:00 -07:00
Weves
be20586ba1 Add retries for confluence calls 2025-04-01 23:00:37 -07:00
Weves
a314462d1e Fix migrations 2025-04-01 21:48:32 -07:00
rkuo-danswer
155f53c3d7 Revert "Add user invitation test (#4161)" (#4422)
This reverts commit 806de92feb.

Co-authored-by: Richard Kuo (Onyx) <rkuo@onyx.app>
2025-04-01 19:55:04 -07:00
pablonyx
7c027df186 Fix cc pair doc deletion (#4420) 2025-04-01 18:44:15 -07:00
pablonyx
0a5db96026 update (#4415) 2025-04-02 00:42:42 +00:00
joachim-danswer
daef985b02 Simpler approach (#4414) 2025-04-01 16:52:59 -07:00
138 changed files with 8371 additions and 981 deletions

View File

@@ -23,6 +23,10 @@ env:
# Jira
JIRA_USER_EMAIL: ${{ secrets.JIRA_USER_EMAIL }}
JIRA_API_TOKEN: ${{ secrets.JIRA_API_TOKEN }}
GONG_ACCESS_KEY: ${{ secrets.GONG_ACCESS_KEY }}
GONG_ACCESS_KEY_SECRET: ${{ secrets.GONG_ACCESS_KEY_SECRET }}
# Google
GOOGLE_DRIVE_SERVICE_ACCOUNT_JSON_STR: ${{ secrets.GOOGLE_DRIVE_SERVICE_ACCOUNT_JSON_STR }}
GOOGLE_DRIVE_OAUTH_CREDENTIALS_JSON_STR_TEST_USER_1: ${{ secrets.GOOGLE_DRIVE_OAUTH_CREDENTIALS_JSON_STR_TEST_USER_1 }}

View File

@@ -30,30 +30,26 @@ Keep knowledge and access controls sync-ed across over 40 connectors like Google
Create custom AI agents with unique prompts, knowledge, and actions that the agents can take.
Onyx can be deployed securely anywhere and for any scale - on a laptop, on-premise, or to cloud.
<h3>Feature Highlights</h3>
**Deep research over your team's knowledge:**
https://private-user-images.githubusercontent.com/32520769/414509312-48392e83-95d0-4fb5-8650-a396e05e0a32.mp4?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3Mzk5Mjg2MzYsIm5iZiI6MTczOTkyODMzNiwicGF0aCI6Ii8zMjUyMDc2OS80MTQ1MDkzMTItNDgzOTJlODMtOTVkMC00ZmI1LTg2NTAtYTM5NmUwNWUwYTMyLm1wND9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNTAyMTklMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjUwMjE5VDAxMjUzNlomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPWFhMzk5Njg2Y2Y5YjFmNDNiYTQ2YzM5ZTg5YWJiYTU2NWMyY2YwNmUyODE2NWUxMDRiMWQxZWJmODI4YTA0MTUmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0In0.a9D8A0sgKE9AoaoE-mfFbJ6_OKYeqaf7TZ4Han2JfW8
**Use Onyx as a secure AI Chat with any LLM:**
![Onyx Chat Silent Demo](https://github.com/onyx-dot-app/onyx/releases/download/v0.21.1/OnyxChatSilentDemo.gif)
**Easily set up connectors to your apps:**
![Onyx Connector Silent Demo](https://github.com/onyx-dot-app/onyx/releases/download/v0.21.1/OnyxConnectorSilentDemo.gif)
**Access Onyx where your team already works:**
![Onyx Bot Demo](https://github.com/onyx-dot-app/onyx/releases/download/v0.21.1/OnyxBot.png)
## Deployment
**To try it out for free and get started in seconds, check out [Onyx Cloud](https://cloud.onyx.app/signup)**.
Onyx can also be run locally (even on a laptop) or deployed on a virtual machine with a single
@@ -62,23 +58,23 @@ Onyx can also be run locally (even on a laptop) or deployed on a virtual machine
We also have built-in support for high-availability/scalable deployment on Kubernetes.
References [here](https://github.com/onyx-dot-app/onyx/tree/main/deployment).
## 🔍 Other Notable Benefits of Onyx
- Custom deep learning models for indexing and inference time, only through Onyx + learning from user feedback.
- Flexible security features like SSO (OIDC/SAML/OAuth2), RBAC, encryption of credentials, etc.
- Knowledge curation features like document-sets, query history, usage analytics, etc.
- Scalable deployment options tested up to many tens of thousands users and hundreds of millions of documents.
## 🚧 Roadmap
- New methods in information retrieval (StructRAG, LightGraphRAG, etc.)
- Personalized Search
- Organizational understanding and ability to locate and suggest experts from your team.
- Code Search
- SQL and Structured Query Language
## 🔌 Connectors
Keep knowledge and access up to sync across 40+ connectors:
- Google Drive
@@ -99,19 +95,65 @@ Keep knowledge and access up to sync across 40+ connectors:
See the full list [here](https://docs.onyx.app/connectors).
## 📚 Licensing
There are two editions of Onyx:
- Onyx Community Edition (CE) is available freely under the MIT Expat license. Simply follow the Deployment guide above.
- Onyx Enterprise Edition (EE) includes extra features that are primarily useful for larger organizations.
For feature details, check out [our website](https://www.onyx.app/pricing).
For feature details, check out [our website](https://www.onyx.app/pricing).
To try the Onyx Enterprise Edition:
1. Checkout [Onyx Cloud](https://cloud.onyx.app/signup).
2. For self-hosting the Enterprise Edition, contact us at [founders@onyx.app](mailto:founders@onyx.app) or book a call with us on our [Cal](https://cal.com/team/onyx/founders).
## 💡 Contributing
Looking to contribute? Please check out the [Contribution Guide](CONTRIBUTING.md) for more details.
# YC Company Twitter Scraper
A script that scrapes YC company pages and extracts Twitter/X.com links.
## Requirements
- Python 3.7+
- Playwright
## Installation
1. Install the required packages:
```
pip install -r requirements.txt
```
2. Install Playwright browsers:
```
playwright install
```
## Usage
Run the script with default settings:
```
python scrape_yc_twitter.py
```
This will scrape the YC companies from recent batches (W23, S23, S24, F24, S22, W22) and save the Twitter links to `twitter_links.txt`.
### Custom URL and Output
```
python scrape_yc_twitter.py --url "https://www.ycombinator.com/companies?batch=W24" --output "w24_twitter.txt"
```
## How it works
1. Navigates to the specified YC companies page
2. Scrolls down to load all company cards
3. Extracts links to individual company pages
4. Visits each company page and extracts Twitter/X.com links
5. Saves the results to a text file

45
YC_SCRAPER_README.md Normal file
View File

@@ -0,0 +1,45 @@
# YC Company Twitter Scraper
A script that scrapes YC company pages and extracts Twitter/X.com links.
## Requirements
- Python 3.7+
- Playwright
## Installation
1. Install the required packages:
```
pip install -r requirements.txt
```
2. Install Playwright browsers:
```
playwright install
```
## Usage
Run the script with default settings:
```
python scrape_yc_twitter.py
```
This will scrape the YC companies from recent batches (W23, S23, S24, F24, S22, W22) and save the Twitter links to `twitter_links.txt`.
### Custom URL and Output
```
python scrape_yc_twitter.py --url "https://www.ycombinator.com/companies?batch=W24" --output "w24_twitter.txt"
```
## How it works
1. Navigates to the specified YC companies page
2. Scrolls down to load all company cards
3. Extracts links to individual company pages
4. Visits each company page and extracts Twitter/X.com links
5. Saves the results to a text file

View File

@@ -46,6 +46,7 @@ WORKDIR /app
# Utils used by model server
COPY ./onyx/utils/logger.py /app/onyx/utils/logger.py
COPY ./onyx/utils/middleware.py /app/onyx/utils/middleware.py
# Place to fetch version information
COPY ./onyx/__init__.py /app/onyx/__init__.py

View File

@@ -0,0 +1,50 @@
"""update prompt length
Revision ID: 4794bc13e484
Revises: f7505c5b0284
Create Date: 2025-04-02 11:26:36.180328
"""
from alembic import op
import sqlalchemy as sa
# revision identifiers, used by Alembic.
revision = "4794bc13e484"
down_revision = "f7505c5b0284"
branch_labels = None
depends_on = None
def upgrade() -> None:
op.alter_column(
"prompt",
"system_prompt",
existing_type=sa.TEXT(),
type_=sa.String(length=5000000),
existing_nullable=False,
)
op.alter_column(
"prompt",
"task_prompt",
existing_type=sa.TEXT(),
type_=sa.String(length=5000000),
existing_nullable=False,
)
def downgrade() -> None:
op.alter_column(
"prompt",
"system_prompt",
existing_type=sa.String(length=5000000),
type_=sa.TEXT(),
existing_nullable=False,
)
op.alter_column(
"prompt",
"task_prompt",
existing_type=sa.String(length=5000000),
type_=sa.TEXT(),
existing_nullable=False,
)

View File

@@ -0,0 +1,50 @@
"""add prompt length limit
Revision ID: f71470ba9274
Revises: 6a804aeb4830
Create Date: 2025-04-01 15:07:14.977435
"""
# revision identifiers, used by Alembic.
revision = "f71470ba9274"
down_revision = "6a804aeb4830"
branch_labels = None
depends_on = None
def upgrade() -> None:
# op.alter_column(
# "prompt",
# "system_prompt",
# existing_type=sa.TEXT(),
# type_=sa.String(length=8000),
# existing_nullable=False,
# )
# op.alter_column(
# "prompt",
# "task_prompt",
# existing_type=sa.TEXT(),
# type_=sa.String(length=8000),
# existing_nullable=False,
# )
pass
def downgrade() -> None:
# op.alter_column(
# "prompt",
# "system_prompt",
# existing_type=sa.String(length=8000),
# type_=sa.TEXT(),
# existing_nullable=False,
# )
# op.alter_column(
# "prompt",
# "task_prompt",
# existing_type=sa.String(length=8000),
# type_=sa.TEXT(),
# existing_nullable=False,
# )
pass

View File

@@ -0,0 +1,77 @@
"""updated constraints for ccpairs
Revision ID: f7505c5b0284
Revises: f71470ba9274
Create Date: 2025-04-01 17:50:42.504818
"""
from alembic import op
# revision identifiers, used by Alembic.
revision = "f7505c5b0284"
down_revision = "f71470ba9274"
branch_labels = None
depends_on = None
def upgrade() -> None:
# 1) Drop the old foreign-key constraints
op.drop_constraint(
"document_by_connector_credential_pair_connector_id_fkey",
"document_by_connector_credential_pair",
type_="foreignkey",
)
op.drop_constraint(
"document_by_connector_credential_pair_credential_id_fkey",
"document_by_connector_credential_pair",
type_="foreignkey",
)
# 2) Re-add them with ondelete='CASCADE'
op.create_foreign_key(
"document_by_connector_credential_pair_connector_id_fkey",
source_table="document_by_connector_credential_pair",
referent_table="connector",
local_cols=["connector_id"],
remote_cols=["id"],
ondelete="CASCADE",
)
op.create_foreign_key(
"document_by_connector_credential_pair_credential_id_fkey",
source_table="document_by_connector_credential_pair",
referent_table="credential",
local_cols=["credential_id"],
remote_cols=["id"],
ondelete="CASCADE",
)
def downgrade() -> None:
# Reverse the changes for rollback
op.drop_constraint(
"document_by_connector_credential_pair_connector_id_fkey",
"document_by_connector_credential_pair",
type_="foreignkey",
)
op.drop_constraint(
"document_by_connector_credential_pair_credential_id_fkey",
"document_by_connector_credential_pair",
type_="foreignkey",
)
# Recreate without CASCADE
op.create_foreign_key(
"document_by_connector_credential_pair_connector_id_fkey",
"document_by_connector_credential_pair",
"connector",
["connector_id"],
["id"],
)
op.create_foreign_key(
"document_by_connector_credential_pair_credential_id_fkey",
"document_by_connector_credential_pair",
"credential",
["credential_id"],
["id"],
)

View File

@@ -159,6 +159,9 @@ def _get_space_permissions(
# Stores the permissions for each space
space_permissions_by_space_key[space_key] = space_permissions
logger.info(
f"Found space permissions for space '{space_key}': {space_permissions}"
)
return space_permissions_by_space_key

View File

@@ -55,7 +55,7 @@ def _post_query_chunk_censoring(
# if user is None, permissions are not enforced
return chunks
chunks_to_keep = []
final_chunk_dict: dict[str, InferenceChunk] = {}
chunks_to_process: dict[DocumentSource, list[InferenceChunk]] = {}
sources_to_censor = _get_all_censoring_enabled_sources()
@@ -64,7 +64,7 @@ def _post_query_chunk_censoring(
if chunk.source_type in sources_to_censor:
chunks_to_process.setdefault(chunk.source_type, []).append(chunk)
else:
chunks_to_keep.append(chunk)
final_chunk_dict[chunk.unique_id] = chunk
# For each source, filter out the chunks using the permission
# check function for that source
@@ -79,6 +79,16 @@ def _post_query_chunk_censoring(
f" chunks for this source and continuing: {e}"
)
continue
chunks_to_keep.extend(censored_chunks)
return chunks_to_keep
for censored_chunk in censored_chunks:
final_chunk_dict[censored_chunk.unique_id] = censored_chunk
# IMPORTANT: make sure to retain the same ordering as the original `chunks` passed in
final_chunk_list: list[InferenceChunk] = []
for chunk in chunks:
# only if the chunk is in the final censored chunks, add it to the final list
# if it is missing, that means it was intentionally left out
if chunk.unique_id in final_chunk_dict:
final_chunk_list.append(final_chunk_dict[chunk.unique_id])
return final_chunk_list

View File

@@ -51,9 +51,9 @@ def _get_objects_access_for_user_email_from_salesforce(
# This is cached in the function so the first query takes an extra 0.1-0.3 seconds
# but subsequent queries by the same user are essentially instant
start_time = time.time()
start_time = time.monotonic()
user_id = get_salesforce_user_id_from_email(salesforce_client, user_email)
end_time = time.time()
end_time = time.monotonic()
logger.info(
f"Time taken to get Salesforce user ID: {end_time - start_time} seconds"
)

View File

@@ -1,10 +1,6 @@
from simple_salesforce import Salesforce
from sqlalchemy.orm import Session
from onyx.connectors.salesforce.sqlite_functions import get_user_id_by_email
from onyx.connectors.salesforce.sqlite_functions import init_db
from onyx.connectors.salesforce.sqlite_functions import NULL_ID_STRING
from onyx.connectors.salesforce.sqlite_functions import update_email_to_id_table
from onyx.db.connector_credential_pair import get_connector_credential_pair_from_id
from onyx.db.document import get_cc_pairs_for_document
from onyx.utils.logger import setup_logger
@@ -28,6 +24,8 @@ def get_any_salesforce_client_for_doc_id(
E.g. there are 2 different credential sets for 2 different salesforce cc_pairs
but only one has the permissions to access the permissions needed for the query.
"""
# NOTE: this global seems very very bad
global _ANY_SALESFORCE_CLIENT
if _ANY_SALESFORCE_CLIENT is None:
cc_pairs = get_cc_pairs_for_document(db_session, doc_id)
@@ -42,11 +40,18 @@ def get_any_salesforce_client_for_doc_id(
def _query_salesforce_user_id(sf_client: Salesforce, user_email: str) -> str | None:
query = f"SELECT Id FROM User WHERE Email = '{user_email}'"
query = f"SELECT Id FROM User WHERE Username = '{user_email}' AND IsActive = true"
result = sf_client.query(query)
if len(result["records"]) == 0:
return None
return result["records"][0]["Id"]
if len(result["records"]) > 0:
return result["records"][0]["Id"]
# try emails
query = f"SELECT Id FROM User WHERE Email = '{user_email}' AND IsActive = true"
result = sf_client.query(query)
if len(result["records"]) > 0:
return result["records"][0]["Id"]
return None
# This contains only the user_ids that we have found in Salesforce.
@@ -77,35 +82,21 @@ def get_salesforce_user_id_from_email(
salesforce database. (Around 0.1-0.3 seconds)
If it's cached or stored in the local salesforce database, it's fast (<0.001 seconds).
"""
# NOTE: this global seems bad
global _CACHED_SF_EMAIL_TO_ID_MAP
if user_email in _CACHED_SF_EMAIL_TO_ID_MAP:
if _CACHED_SF_EMAIL_TO_ID_MAP[user_email] is not None:
return _CACHED_SF_EMAIL_TO_ID_MAP[user_email]
db_exists = True
try:
# Check if the user is already in the database
user_id = get_user_id_by_email(user_email)
except Exception:
init_db()
try:
user_id = get_user_id_by_email(user_email)
except Exception as e:
logger.error(f"Error checking if user is in database: {e}")
user_id = None
db_exists = False
# some caching via sqlite existed here before ... check history if interested
# ...query Salesforce and store the result in the database
user_id = _query_salesforce_user_id(sf_client, user_email)
# If no entry is found in the database (indicated by user_id being None)...
if user_id is None:
# ...query Salesforce and store the result in the database
user_id = _query_salesforce_user_id(sf_client, user_email)
if db_exists:
update_email_to_id_table(user_email, user_id)
return user_id
elif user_id is None:
return None
elif user_id == NULL_ID_STRING:
return None
# If the found user_id is real, cache it
_CACHED_SF_EMAIL_TO_ID_MAP[user_email] = user_id
return user_id

View File

@@ -5,12 +5,14 @@ from slack_sdk import WebClient
from ee.onyx.external_permissions.slack.utils import fetch_user_id_to_email_map
from onyx.access.models import DocExternalAccess
from onyx.access.models import ExternalAccess
from onyx.connectors.credentials_provider import OnyxDBCredentialsProvider
from onyx.connectors.slack.connector import get_channels
from onyx.connectors.slack.connector import make_paginated_slack_api_call_w_retries
from onyx.connectors.slack.connector import SlackConnector
from onyx.db.models import ConnectorCredentialPair
from onyx.indexing.indexing_heartbeat import IndexingHeartbeatInterface
from onyx.utils.logger import setup_logger
from shared_configs.contextvars import get_current_tenant_id
logger = setup_logger()
@@ -101,7 +103,12 @@ def _get_slack_document_access(
callback: IndexingHeartbeatInterface | None,
) -> Generator[DocExternalAccess, None, None]:
slack_connector = SlackConnector(**cc_pair.connector.connector_specific_config)
slack_connector.load_credentials(cc_pair.credential.credential_json)
# Use credentials provider instead of directly loading credentials
provider = OnyxDBCredentialsProvider(
get_current_tenant_id(), "slack", cc_pair.credential.id
)
slack_connector.set_credentials_provider(provider)
slim_doc_generator = slack_connector.retrieve_all_slim_documents(callback=callback)

View File

@@ -51,6 +51,7 @@ def _get_slack_group_members_email(
def slack_group_sync(
tenant_id: str,
cc_pair: ConnectorCredentialPair,
) -> list[ExternalUserGroup]:
slack_client = WebClient(

View File

@@ -15,6 +15,7 @@ from ee.onyx.external_permissions.post_query_censoring import (
DOC_SOURCE_TO_CHUNK_CENSORING_FUNCTION,
)
from ee.onyx.external_permissions.slack.doc_sync import slack_doc_sync
from ee.onyx.external_permissions.slack.group_sync import slack_group_sync
from onyx.access.models import DocExternalAccess
from onyx.configs.constants import DocumentSource
from onyx.db.models import ConnectorCredentialPair
@@ -56,6 +57,7 @@ DOC_PERMISSIONS_FUNC_MAP: dict[DocumentSource, DocSyncFuncType] = {
GROUP_PERMISSIONS_FUNC_MAP: dict[DocumentSource, GroupSyncFuncType] = {
DocumentSource.GOOGLE_DRIVE: gdrive_group_sync,
DocumentSource.CONFLUENCE: confluence_group_sync,
DocumentSource.SLACK: slack_group_sync,
}

View File

@@ -1,3 +1,4 @@
import logging
import os
import shutil
from collections.abc import AsyncGenerator
@@ -8,6 +9,7 @@ import sentry_sdk
import torch
import uvicorn
from fastapi import FastAPI
from prometheus_fastapi_instrumentator import Instrumentator
from sentry_sdk.integrations.fastapi import FastApiIntegration
from sentry_sdk.integrations.starlette import StarletteIntegration
from transformers import logging as transformer_logging # type:ignore
@@ -20,6 +22,8 @@ from model_server.management_endpoints import router as management_router
from model_server.utils import get_gpu_type
from onyx import __version__
from onyx.utils.logger import setup_logger
from onyx.utils.logger import setup_uvicorn_logger
from onyx.utils.middleware import add_onyx_request_id_middleware
from shared_configs.configs import INDEXING_ONLY
from shared_configs.configs import MIN_THREADS_ML_MODELS
from shared_configs.configs import MODEL_SERVER_ALLOWED_HOST
@@ -36,6 +40,12 @@ transformer_logging.set_verbosity_error()
logger = setup_logger()
file_handlers = [
h for h in logger.logger.handlers if isinstance(h, logging.FileHandler)
]
setup_uvicorn_logger(shared_file_handlers=file_handlers)
def _move_files_recursively(source: Path, dest: Path, overwrite: bool = False) -> None:
"""
@@ -112,6 +122,15 @@ def get_model_app() -> FastAPI:
application.include_router(encoders_router)
application.include_router(custom_models_router)
request_id_prefix = "INF"
if INDEXING_ONLY:
request_id_prefix = "IDX"
add_onyx_request_id_middleware(application, request_id_prefix, logger)
# Initialize and instrument the app
Instrumentator().instrument(application).expose(application)
return application

View File

@@ -15,6 +15,22 @@ class ExternalAccess:
# Whether the document is public in the external system or Onyx
is_public: bool
def __str__(self) -> str:
"""Prevent extremely long logs"""
def truncate_set(s: set[str], max_len: int = 100) -> str:
s_str = str(s)
if len(s_str) > max_len:
return f"{s_str[:max_len]}... ({len(s)} items)"
return s_str
return (
f"ExternalAccess("
f"external_user_emails={truncate_set(self.external_user_emails)}, "
f"external_user_group_ids={truncate_set(self.external_user_group_ids)}, "
f"is_public={self.is_public})"
)
@dataclass(frozen=True)
class DocExternalAccess:

View File

@@ -0,0 +1,62 @@
from collections.abc import Hashable
from typing import cast
from langchain_core.runnables.config import RunnableConfig
from langgraph.types import Send
from onyx.agents.agent_search.dc_search_analysis.states import ObjectInformationInput
from onyx.agents.agent_search.dc_search_analysis.states import (
ObjectResearchInformationUpdate,
)
from onyx.agents.agent_search.dc_search_analysis.states import ObjectSourceInput
from onyx.agents.agent_search.dc_search_analysis.states import (
SearchSourcesObjectsUpdate,
)
from onyx.agents.agent_search.models import GraphConfig
def parallel_object_source_research_edge(
state: SearchSourcesObjectsUpdate, config: RunnableConfig
) -> list[Send | Hashable]:
"""
LangGraph edge to parallelize the research for an individual object and source
"""
search_objects = state.analysis_objects
search_sources = state.analysis_sources
object_source_combinations = [
(object, source) for object in search_objects for source in search_sources
]
return [
Send(
"research_object_source",
ObjectSourceInput(
object_source_combination=object_source_combination,
log_messages=[],
),
)
for object_source_combination in object_source_combinations
]
def parallel_object_research_consolidation_edge(
state: ObjectResearchInformationUpdate, config: RunnableConfig
) -> list[Send | Hashable]:
"""
LangGraph edge to parallelize the research for an individual object and source
"""
cast(GraphConfig, config["metadata"]["config"])
object_research_information_results = state.object_research_information_results
return [
Send(
"consolidate_object_research",
ObjectInformationInput(
object_information=object_information,
log_messages=[],
),
)
for object_information in object_research_information_results
]

View File

@@ -0,0 +1,103 @@
from langgraph.graph import END
from langgraph.graph import START
from langgraph.graph import StateGraph
from onyx.agents.agent_search.dc_search_analysis.edges import (
parallel_object_research_consolidation_edge,
)
from onyx.agents.agent_search.dc_search_analysis.edges import (
parallel_object_source_research_edge,
)
from onyx.agents.agent_search.dc_search_analysis.nodes.a1_search_objects import (
search_objects,
)
from onyx.agents.agent_search.dc_search_analysis.nodes.a2_research_object_source import (
research_object_source,
)
from onyx.agents.agent_search.dc_search_analysis.nodes.a3_structure_research_by_object import (
structure_research_by_object,
)
from onyx.agents.agent_search.dc_search_analysis.nodes.a4_consolidate_object_research import (
consolidate_object_research,
)
from onyx.agents.agent_search.dc_search_analysis.nodes.a5_consolidate_research import (
consolidate_research,
)
from onyx.agents.agent_search.dc_search_analysis.states import MainInput
from onyx.agents.agent_search.dc_search_analysis.states import MainState
from onyx.utils.logger import setup_logger
logger = setup_logger()
test_mode = False
def divide_and_conquer_graph_builder(test_mode: bool = False) -> StateGraph:
"""
LangGraph graph builder for the knowledge graph search process.
"""
graph = StateGraph(
state_schema=MainState,
input=MainInput,
)
### Add nodes ###
graph.add_node(
"search_objects",
search_objects,
)
graph.add_node(
"structure_research_by_source",
structure_research_by_object,
)
graph.add_node(
"research_object_source",
research_object_source,
)
graph.add_node(
"consolidate_object_research",
consolidate_object_research,
)
graph.add_node(
"consolidate_research",
consolidate_research,
)
### Add edges ###
graph.add_edge(start_key=START, end_key="search_objects")
graph.add_conditional_edges(
source="search_objects",
path=parallel_object_source_research_edge,
path_map=["research_object_source"],
)
graph.add_edge(
start_key="research_object_source",
end_key="structure_research_by_source",
)
graph.add_conditional_edges(
source="structure_research_by_source",
path=parallel_object_research_consolidation_edge,
path_map=["consolidate_object_research"],
)
graph.add_edge(
start_key="consolidate_object_research",
end_key="consolidate_research",
)
graph.add_edge(
start_key="consolidate_research",
end_key=END,
)
return graph

View File

@@ -0,0 +1,159 @@
from typing import cast
from langchain_core.messages import HumanMessage
from langchain_core.runnables import RunnableConfig
from langgraph.types import StreamWriter
from onyx.agents.agent_search.dc_search_analysis.ops import extract_section
from onyx.agents.agent_search.dc_search_analysis.ops import research
from onyx.agents.agent_search.dc_search_analysis.states import MainState
from onyx.agents.agent_search.dc_search_analysis.states import (
SearchSourcesObjectsUpdate,
)
from onyx.agents.agent_search.models import GraphConfig
from onyx.agents.agent_search.shared_graph_utils.agent_prompt_ops import (
trim_prompt_piece,
)
from onyx.agents.agent_search.shared_graph_utils.utils import write_custom_event
from onyx.chat.models import AgentAnswerPiece
from onyx.configs.constants import DocumentSource
from onyx.prompts.agents.dc_prompts import DC_OBJECT_NO_BASE_DATA_EXTRACTION_PROMPT
from onyx.prompts.agents.dc_prompts import DC_OBJECT_SEPARATOR
from onyx.prompts.agents.dc_prompts import DC_OBJECT_WITH_BASE_DATA_EXTRACTION_PROMPT
from onyx.utils.logger import setup_logger
from onyx.utils.threadpool_concurrency import run_with_timeout
logger = setup_logger()
def search_objects(
state: MainState, config: RunnableConfig, writer: StreamWriter = lambda _: None
) -> SearchSourcesObjectsUpdate:
"""
LangGraph node to start the agentic search process.
"""
graph_config = cast(GraphConfig, config["metadata"]["config"])
question = graph_config.inputs.search_request.query
search_tool = graph_config.tooling.search_tool
if search_tool is None or graph_config.inputs.search_request.persona is None:
raise ValueError("Search tool and persona must be provided for DivCon search")
try:
instructions = graph_config.inputs.search_request.persona.prompts[
0
].system_prompt
agent_1_instructions = extract_section(
instructions, "Agent Step 1:", "Agent Step 2:"
)
if agent_1_instructions is None:
raise ValueError("Agent 1 instructions not found")
agent_1_base_data = extract_section(instructions, "|Start Data|", "|End Data|")
agent_1_task = extract_section(
agent_1_instructions, "Task:", "Independent Research Sources:"
)
if agent_1_task is None:
raise ValueError("Agent 1 task not found")
agent_1_independent_sources_str = extract_section(
agent_1_instructions, "Independent Research Sources:", "Output Objective:"
)
if agent_1_independent_sources_str is None:
raise ValueError("Agent 1 Independent Research Sources not found")
document_sources = [
DocumentSource(x.strip().lower())
for x in agent_1_independent_sources_str.split(DC_OBJECT_SEPARATOR)
]
agent_1_output_objective = extract_section(
agent_1_instructions, "Output Objective:"
)
if agent_1_output_objective is None:
raise ValueError("Agent 1 output objective not found")
except Exception as e:
raise ValueError(
f"Agent 1 instructions not found or not formatted correctly: {e}"
)
# Extract objects
if agent_1_base_data is None:
# Retrieve chunks for objects
retrieved_docs = research(question, search_tool)[:10]
document_texts_list = []
for doc_num, doc in enumerate(retrieved_docs):
chunk_text = "Document " + str(doc_num) + ":\n" + doc.content
document_texts_list.append(chunk_text)
document_texts = "\n\n".join(document_texts_list)
dc_object_extraction_prompt = DC_OBJECT_NO_BASE_DATA_EXTRACTION_PROMPT.format(
question=question,
task=agent_1_task,
document_text=document_texts,
objects_of_interest=agent_1_output_objective,
)
else:
dc_object_extraction_prompt = DC_OBJECT_WITH_BASE_DATA_EXTRACTION_PROMPT.format(
question=question,
task=agent_1_task,
base_data=agent_1_base_data,
objects_of_interest=agent_1_output_objective,
)
msg = [
HumanMessage(
content=trim_prompt_piece(
config=graph_config.tooling.primary_llm.config,
prompt_piece=dc_object_extraction_prompt,
reserved_str="",
),
)
]
primary_llm = graph_config.tooling.primary_llm
# Grader
try:
llm_response = run_with_timeout(
30,
primary_llm.invoke,
prompt=msg,
timeout_override=30,
max_tokens=300,
)
cleaned_response = (
str(llm_response.content)
.replace("```json\n", "")
.replace("\n```", "")
.replace("\n", "")
)
cleaned_response = cleaned_response.split("OBJECTS:")[1]
object_list = [x.strip() for x in cleaned_response.split(";")]
except Exception as e:
raise ValueError(f"Error in search_objects: {e}")
write_custom_event(
"initial_agent_answer",
AgentAnswerPiece(
answer_piece=" Researching the individual objects for each source type... ",
level=0,
level_question_num=0,
answer_type="agent_level_answer",
),
writer,
)
return SearchSourcesObjectsUpdate(
analysis_objects=object_list,
analysis_sources=document_sources,
log_messages=["Agent 1 Task done"],
)

View File

@@ -0,0 +1,185 @@
from datetime import datetime
from datetime import timedelta
from datetime import timezone
from typing import cast
from langchain_core.messages import HumanMessage
from langchain_core.runnables import RunnableConfig
from langgraph.types import StreamWriter
from onyx.agents.agent_search.dc_search_analysis.ops import extract_section
from onyx.agents.agent_search.dc_search_analysis.ops import research
from onyx.agents.agent_search.dc_search_analysis.states import ObjectSourceInput
from onyx.agents.agent_search.dc_search_analysis.states import (
ObjectSourceResearchUpdate,
)
from onyx.agents.agent_search.models import GraphConfig
from onyx.agents.agent_search.shared_graph_utils.agent_prompt_ops import (
trim_prompt_piece,
)
from onyx.prompts.agents.dc_prompts import DC_OBJECT_SOURCE_RESEARCH_PROMPT
from onyx.utils.logger import setup_logger
from onyx.utils.threadpool_concurrency import run_with_timeout
logger = setup_logger()
def research_object_source(
state: ObjectSourceInput,
config: RunnableConfig,
writer: StreamWriter = lambda _: None,
) -> ObjectSourceResearchUpdate:
"""
LangGraph node to start the agentic search process.
"""
datetime.now()
graph_config = cast(GraphConfig, config["metadata"]["config"])
graph_config.inputs.search_request.query
search_tool = graph_config.tooling.search_tool
question = graph_config.inputs.search_request.query
object, document_source = state.object_source_combination
if search_tool is None or graph_config.inputs.search_request.persona is None:
raise ValueError("Search tool and persona must be provided for DivCon search")
try:
instructions = graph_config.inputs.search_request.persona.prompts[
0
].system_prompt
agent_2_instructions = extract_section(
instructions, "Agent Step 2:", "Agent Step 3:"
)
if agent_2_instructions is None:
raise ValueError("Agent 2 instructions not found")
agent_2_task = extract_section(
agent_2_instructions, "Task:", "Independent Research Sources:"
)
if agent_2_task is None:
raise ValueError("Agent 2 task not found")
agent_2_time_cutoff = extract_section(
agent_2_instructions, "Time Cutoff:", "Research Topics:"
)
agent_2_research_topics = extract_section(
agent_2_instructions, "Research Topics:", "Output Objective"
)
agent_2_output_objective = extract_section(
agent_2_instructions, "Output Objective:"
)
if agent_2_output_objective is None:
raise ValueError("Agent 2 output objective not found")
except Exception:
raise ValueError(
"Agent 1 instructions not found or not formatted correctly: {e}"
)
# Populate prompt
# Retrieve chunks for objects
if agent_2_time_cutoff is not None and agent_2_time_cutoff.strip() != "":
if agent_2_time_cutoff.strip().endswith("d"):
try:
days = int(agent_2_time_cutoff.strip()[:-1])
agent_2_source_start_time = datetime.now(timezone.utc) - timedelta(
days=days
)
except ValueError:
raise ValueError(
f"Invalid time cutoff format: {agent_2_time_cutoff}. Expected format: '<number>d'"
)
else:
raise ValueError(
f"Invalid time cutoff format: {agent_2_time_cutoff}. Expected format: '<number>d'"
)
else:
agent_2_source_start_time = None
document_sources = [document_source] if document_source else None
if len(question.strip()) > 0:
research_area = f"{question} for {object}"
elif agent_2_research_topics and len(agent_2_research_topics.strip()) > 0:
research_area = f"{agent_2_research_topics} for {object}"
else:
research_area = object
retrieved_docs = research(
question=research_area,
search_tool=search_tool,
document_sources=document_sources,
time_cutoff=agent_2_source_start_time,
)
# Generate document text
document_texts_list = []
for doc_num, doc in enumerate(retrieved_docs):
chunk_text = "Document " + str(doc_num) + ":\n" + doc.content
document_texts_list.append(chunk_text)
document_texts = "\n\n".join(document_texts_list)
# Built prompt
today = datetime.now().strftime("%A, %Y-%m-%d")
dc_object_source_research_prompt = (
DC_OBJECT_SOURCE_RESEARCH_PROMPT.format(
today=today,
question=question,
task=agent_2_task,
document_text=document_texts,
format=agent_2_output_objective,
)
.replace("---object---", object)
.replace("---source---", document_source.value)
)
# Run LLM
msg = [
HumanMessage(
content=trim_prompt_piece(
config=graph_config.tooling.primary_llm.config,
prompt_piece=dc_object_source_research_prompt,
reserved_str="",
),
)
]
# fast_llm = graph_config.tooling.fast_llm
primary_llm = graph_config.tooling.primary_llm
llm = primary_llm
# Grader
try:
llm_response = run_with_timeout(
30,
llm.invoke,
prompt=msg,
timeout_override=30,
max_tokens=300,
)
cleaned_response = str(llm_response.content).replace("```json\n", "")
cleaned_response = cleaned_response.split("RESEARCH RESULTS:")[1]
object_research_results = {
"object": object,
"source": document_source.value,
"research_result": cleaned_response,
}
except Exception as e:
raise ValueError(f"Error in research_object_source: {e}")
logger.debug("DivCon Step A2 - Object Source Research - completed for an object")
return ObjectSourceResearchUpdate(
object_source_research_results=[object_research_results],
log_messages=["Agent Step 2 done for one object"],
)

View File

@@ -0,0 +1,68 @@
from collections import defaultdict
from datetime import datetime
from typing import cast
from typing import Dict
from typing import List
from langchain_core.runnables import RunnableConfig
from langgraph.types import StreamWriter
from onyx.agents.agent_search.dc_search_analysis.states import MainState
from onyx.agents.agent_search.dc_search_analysis.states import (
ObjectResearchInformationUpdate,
)
from onyx.agents.agent_search.models import GraphConfig
from onyx.agents.agent_search.shared_graph_utils.utils import write_custom_event
from onyx.chat.models import AgentAnswerPiece
from onyx.utils.logger import setup_logger
logger = setup_logger()
def structure_research_by_object(
state: MainState, config: RunnableConfig, writer: StreamWriter = lambda _: None
) -> ObjectResearchInformationUpdate:
"""
LangGraph node to start the agentic search process.
"""
datetime.now()
graph_config = cast(GraphConfig, config["metadata"]["config"])
graph_config.inputs.search_request.query
write_custom_event(
"initial_agent_answer",
AgentAnswerPiece(
answer_piece=" consolidating the information across source types for each object...",
level=0,
level_question_num=0,
answer_type="agent_level_answer",
),
writer,
)
object_source_research_results = state.object_source_research_results
object_research_information_results: List[Dict[str, str]] = []
object_research_information_results_list: Dict[str, List[str]] = defaultdict(list)
for object_source_research in object_source_research_results:
object = object_source_research["object"]
source = object_source_research["source"]
research_result = object_source_research["research_result"]
object_research_information_results_list[object].append(
f"Source: {source}\n{research_result}"
)
for object, information in object_research_information_results_list.items():
object_research_information_results.append(
{"object": object, "information": "\n".join(information)}
)
logger.debug("DivCon Step A3 - Object Research Information Structuring - completed")
return ObjectResearchInformationUpdate(
object_research_information_results=object_research_information_results,
log_messages=["A3 - Object Research Information structured"],
)

View File

@@ -0,0 +1,107 @@
from typing import cast
from langchain_core.messages import HumanMessage
from langchain_core.runnables import RunnableConfig
from langgraph.types import StreamWriter
from onyx.agents.agent_search.dc_search_analysis.ops import extract_section
from onyx.agents.agent_search.dc_search_analysis.states import ObjectInformationInput
from onyx.agents.agent_search.dc_search_analysis.states import ObjectResearchUpdate
from onyx.agents.agent_search.models import GraphConfig
from onyx.agents.agent_search.shared_graph_utils.agent_prompt_ops import (
trim_prompt_piece,
)
from onyx.prompts.agents.dc_prompts import DC_OBJECT_CONSOLIDATION_PROMPT
from onyx.utils.logger import setup_logger
from onyx.utils.threadpool_concurrency import run_with_timeout
logger = setup_logger()
def consolidate_object_research(
state: ObjectInformationInput,
config: RunnableConfig,
writer: StreamWriter = lambda _: None,
) -> ObjectResearchUpdate:
"""
LangGraph node to start the agentic search process.
"""
graph_config = cast(GraphConfig, config["metadata"]["config"])
graph_config.inputs.search_request.query
search_tool = graph_config.tooling.search_tool
question = graph_config.inputs.search_request.query
if search_tool is None or graph_config.inputs.search_request.persona is None:
raise ValueError("Search tool and persona must be provided for DivCon search")
instructions = graph_config.inputs.search_request.persona.prompts[0].system_prompt
agent_4_instructions = extract_section(
instructions, "Agent Step 4:", "Agent Step 5:"
)
if agent_4_instructions is None:
raise ValueError("Agent 4 instructions not found")
agent_4_output_objective = extract_section(
agent_4_instructions, "Output Objective:"
)
if agent_4_output_objective is None:
raise ValueError("Agent 4 output objective not found")
object_information = state.object_information
object = object_information["object"]
information = object_information["information"]
# Create a prompt for the object consolidation
dc_object_consolidation_prompt = DC_OBJECT_CONSOLIDATION_PROMPT.format(
question=question,
object=object,
information=information,
format=agent_4_output_objective,
)
# Run LLM
msg = [
HumanMessage(
content=trim_prompt_piece(
config=graph_config.tooling.primary_llm.config,
prompt_piece=dc_object_consolidation_prompt,
reserved_str="",
),
)
]
graph_config.tooling.primary_llm
# fast_llm = graph_config.tooling.fast_llm
primary_llm = graph_config.tooling.primary_llm
llm = primary_llm
# Grader
try:
llm_response = run_with_timeout(
30,
llm.invoke,
prompt=msg,
timeout_override=30,
max_tokens=300,
)
cleaned_response = str(llm_response.content).replace("```json\n", "")
consolidated_information = cleaned_response.split("INFORMATION:")[1]
except Exception as e:
raise ValueError(f"Error in consolidate_object_research: {e}")
object_research_results = {
"object": object,
"research_result": consolidated_information,
}
logger.debug(
"DivCon Step A4 - Object Research Consolidation - completed for an object"
)
return ObjectResearchUpdate(
object_research_results=[object_research_results],
log_messages=["Agent Source Consilidation done"],
)

View File

@@ -0,0 +1,164 @@
from datetime import datetime
from typing import cast
from langchain_core.messages import HumanMessage
from langchain_core.runnables import RunnableConfig
from langgraph.types import StreamWriter
from onyx.agents.agent_search.dc_search_analysis.ops import extract_section
from onyx.agents.agent_search.dc_search_analysis.states import MainState
from onyx.agents.agent_search.dc_search_analysis.states import ResearchUpdate
from onyx.agents.agent_search.models import GraphConfig
from onyx.agents.agent_search.shared_graph_utils.agent_prompt_ops import (
trim_prompt_piece,
)
from onyx.agents.agent_search.shared_graph_utils.utils import write_custom_event
from onyx.chat.models import AgentAnswerPiece
from onyx.prompts.agents.dc_prompts import DC_FORMATTING_NO_BASE_DATA_PROMPT
from onyx.prompts.agents.dc_prompts import DC_FORMATTING_WITH_BASE_DATA_PROMPT
from onyx.utils.logger import setup_logger
from onyx.utils.threadpool_concurrency import run_with_timeout
logger = setup_logger()
def consolidate_research(
state: MainState, config: RunnableConfig, writer: StreamWriter = lambda _: None
) -> ResearchUpdate:
"""
LangGraph node to start the agentic search process.
"""
datetime.now()
graph_config = cast(GraphConfig, config["metadata"]["config"])
graph_config.inputs.search_request.query
search_tool = graph_config.tooling.search_tool
write_custom_event(
"initial_agent_answer",
AgentAnswerPiece(
answer_piece=" generating the answer\n\n\n",
level=0,
level_question_num=0,
answer_type="agent_level_answer",
),
writer,
)
if search_tool is None or graph_config.inputs.search_request.persona is None:
raise ValueError("Search tool and persona must be provided for DivCon search")
# Populate prompt
instructions = graph_config.inputs.search_request.persona.prompts[0].system_prompt
try:
agent_5_instructions = extract_section(
instructions, "Agent Step 5:", "Agent End"
)
if agent_5_instructions is None:
raise ValueError("Agent 5 instructions not found")
agent_5_base_data = extract_section(instructions, "|Start Data|", "|End Data|")
agent_5_task = extract_section(
agent_5_instructions, "Task:", "Independent Research Sources:"
)
if agent_5_task is None:
raise ValueError("Agent 5 task not found")
agent_5_output_objective = extract_section(
agent_5_instructions, "Output Objective:"
)
if agent_5_output_objective is None:
raise ValueError("Agent 5 output objective not found")
except ValueError as e:
raise ValueError(
f"Instructions for Agent Step 5 were not properly formatted: {e}"
)
research_result_list = []
if agent_5_task.strip() == "*concatenate*":
object_research_results = state.object_research_results
for object_research_result in object_research_results:
object = object_research_result["object"]
research_result = object_research_result["research_result"]
research_result_list.append(f"Object: {object}\n\n{research_result}")
research_results = "\n\n".join(research_result_list)
else:
raise NotImplementedError("Only '*concatenate*' is currently supported")
# Create a prompt for the object consolidation
if agent_5_base_data is None:
dc_formatting_prompt = DC_FORMATTING_NO_BASE_DATA_PROMPT.format(
text=research_results,
format=agent_5_output_objective,
)
else:
dc_formatting_prompt = DC_FORMATTING_WITH_BASE_DATA_PROMPT.format(
base_data=agent_5_base_data,
text=research_results,
format=agent_5_output_objective,
)
# Run LLM
msg = [
HumanMessage(
content=trim_prompt_piece(
config=graph_config.tooling.primary_llm.config,
prompt_piece=dc_formatting_prompt,
reserved_str="",
),
)
]
dispatch_timings: list[float] = []
primary_model = graph_config.tooling.primary_llm
def stream_initial_answer() -> list[str]:
response: list[str] = []
for message in primary_model.stream(msg, timeout_override=30, max_tokens=None):
# TODO: in principle, the answer here COULD contain images, but we don't support that yet
content = message.content
if not isinstance(content, str):
raise ValueError(
f"Expected content to be a string, but got {type(content)}"
)
start_stream_token = datetime.now()
write_custom_event(
"initial_agent_answer",
AgentAnswerPiece(
answer_piece=content,
level=0,
level_question_num=0,
answer_type="agent_level_answer",
),
writer,
)
end_stream_token = datetime.now()
dispatch_timings.append(
(end_stream_token - start_stream_token).microseconds
)
response.append(content)
return response
try:
_ = run_with_timeout(
60,
stream_initial_answer,
)
except Exception as e:
raise ValueError(f"Error in consolidate_research: {e}")
logger.debug("DivCon Step A5 - Final Generation - completed")
return ResearchUpdate(
research_results=research_results,
log_messages=["Agent Source Consilidation done"],
)

View File

@@ -0,0 +1,61 @@
from datetime import datetime
from typing import cast
from onyx.chat.models import LlmDoc
from onyx.configs.constants import DocumentSource
from onyx.context.search.models import InferenceSection
from onyx.db.engine import get_session_with_current_tenant
from onyx.tools.models import SearchToolOverrideKwargs
from onyx.tools.tool_implementations.search.search_tool import (
FINAL_CONTEXT_DOCUMENTS_ID,
)
from onyx.tools.tool_implementations.search.search_tool import SearchTool
def research(
question: str,
search_tool: SearchTool,
document_sources: list[DocumentSource] | None = None,
time_cutoff: datetime | None = None,
) -> list[LlmDoc]:
# new db session to avoid concurrency issues
callback_container: list[list[InferenceSection]] = []
retrieved_docs: list[LlmDoc] = []
with get_session_with_current_tenant() as db_session:
for tool_response in search_tool.run(
query=question,
override_kwargs=SearchToolOverrideKwargs(
force_no_rerank=False,
alternate_db_session=db_session,
retrieved_sections_callback=callback_container.append,
skip_query_analysis=True,
document_sources=document_sources,
time_cutoff=time_cutoff,
),
):
# get retrieved docs to send to the rest of the graph
if tool_response.id == FINAL_CONTEXT_DOCUMENTS_ID:
retrieved_docs = cast(list[LlmDoc], tool_response.response)[:10]
break
return retrieved_docs
def extract_section(
text: str, start_marker: str, end_marker: str | None = None
) -> str | None:
"""Extract text between markers, returning None if markers not found"""
parts = text.split(start_marker)
if len(parts) == 1:
return None
after_start = parts[1].strip()
if not end_marker:
return after_start
extract = after_start.split(end_marker)[0]
return extract.strip()

View File

@@ -0,0 +1,72 @@
from operator import add
from typing import Annotated
from typing import Dict
from typing import TypedDict
from pydantic import BaseModel
from onyx.agents.agent_search.core_state import CoreState
from onyx.agents.agent_search.orchestration.states import ToolCallUpdate
from onyx.agents.agent_search.orchestration.states import ToolChoiceInput
from onyx.agents.agent_search.orchestration.states import ToolChoiceUpdate
from onyx.configs.constants import DocumentSource
### States ###
class LoggerUpdate(BaseModel):
log_messages: Annotated[list[str], add] = []
class SearchSourcesObjectsUpdate(LoggerUpdate):
analysis_objects: list[str] = []
analysis_sources: list[DocumentSource] = []
class ObjectSourceInput(LoggerUpdate):
object_source_combination: tuple[str, DocumentSource]
class ObjectSourceResearchUpdate(LoggerUpdate):
object_source_research_results: Annotated[list[Dict[str, str]], add] = []
class ObjectInformationInput(LoggerUpdate):
object_information: Dict[str, str]
class ObjectResearchInformationUpdate(LoggerUpdate):
object_research_information_results: Annotated[list[Dict[str, str]], add] = []
class ObjectResearchUpdate(LoggerUpdate):
object_research_results: Annotated[list[Dict[str, str]], add] = []
class ResearchUpdate(LoggerUpdate):
research_results: str | None = None
## Graph Input State
class MainInput(CoreState):
pass
## Graph State
class MainState(
# This includes the core state
MainInput,
ToolChoiceInput,
ToolCallUpdate,
ToolChoiceUpdate,
SearchSourcesObjectsUpdate,
ObjectSourceResearchUpdate,
ObjectResearchInformationUpdate,
ObjectResearchUpdate,
ResearchUpdate,
):
pass
## Graph Output State - presently not used
class MainOutput(TypedDict):
log_messages: list[str]

View File

@@ -8,6 +8,10 @@ from langgraph.graph.state import CompiledStateGraph
from onyx.agents.agent_search.basic.graph_builder import basic_graph_builder
from onyx.agents.agent_search.basic.states import BasicInput
from onyx.agents.agent_search.dc_search_analysis.graph_builder import (
divide_and_conquer_graph_builder,
)
from onyx.agents.agent_search.dc_search_analysis.states import MainInput as DCMainInput
from onyx.agents.agent_search.deep_search.main.graph_builder import (
main_graph_builder as main_graph_builder_a,
)
@@ -82,7 +86,7 @@ def _parse_agent_event(
def manage_sync_streaming(
compiled_graph: CompiledStateGraph,
config: GraphConfig,
graph_input: BasicInput | MainInput,
graph_input: BasicInput | MainInput | DCMainInput,
) -> Iterable[StreamEvent]:
message_id = config.persistence.message_id if config.persistence else None
for event in compiled_graph.stream(
@@ -96,7 +100,7 @@ def manage_sync_streaming(
def run_graph(
compiled_graph: CompiledStateGraph,
config: GraphConfig,
input: BasicInput | MainInput,
input: BasicInput | MainInput | DCMainInput,
) -> AnswerStream:
config.behavior.perform_initial_search_decomposition = (
INITIAL_SEARCH_DECOMPOSITION_ENABLED
@@ -146,6 +150,16 @@ def run_basic_graph(
return run_graph(compiled_graph, config, input)
def run_dc_graph(
config: GraphConfig,
) -> AnswerStream:
graph = divide_and_conquer_graph_builder()
compiled_graph = graph.compile()
input = DCMainInput(log_messages=[])
config.inputs.search_request.query = config.inputs.search_request.query.strip()
return run_graph(compiled_graph, config, input)
if __name__ == "__main__":
for _ in range(1):
query_start_time = datetime.now()

View File

@@ -180,3 +180,35 @@ def binary_string_test_after_answer_separator(
relevant_text = text.split(f"{separator}")[-1]
return binary_string_test(relevant_text, positive_value)
def build_dc_search_prompt(
question: str,
original_question: str,
docs: list[InferenceSection],
persona_specification: str,
config: LLMConfig,
) -> list[SystemMessage | HumanMessage | AIMessage | ToolMessage]:
system_message = SystemMessage(
content=persona_specification,
)
date_str = build_date_time_string()
docs_str = format_docs(docs)
docs_str = trim_prompt_piece(
config,
docs_str,
SUB_QUESTION_RAG_PROMPT + question + original_question + date_str,
)
human_message = HumanMessage(
content=SUB_QUESTION_RAG_PROMPT.format(
question=question,
original_question=original_question,
context=docs_str,
date_prompt=date_str,
)
)
return [system_message, human_message]

View File

@@ -1,5 +1,6 @@
import logging
import multiprocessing
import os
import time
from typing import Any
from typing import cast
@@ -305,7 +306,7 @@ def wait_for_db(sender: Any, **kwargs: Any) -> None:
def on_secondary_worker_init(sender: Any, **kwargs: Any) -> None:
logger.info("Running as a secondary celery worker.")
logger.info(f"Running as a secondary celery worker: pid={os.getpid()}")
# Set up variables for waiting on primary worker
WAIT_INTERVAL = 5

View File

@@ -0,0 +1,7 @@
from celery import Celery
import onyx.background.celery.apps.app_base as app_base
celery_app = Celery(__name__)
celery_app.config_from_object("onyx.background.celery.configs.client")
celery_app.Task = app_base.TenantAwareTask # type: ignore [misc]

View File

@@ -1,4 +1,5 @@
import logging
import os
from typing import Any
from typing import cast
@@ -95,7 +96,7 @@ def on_worker_init(sender: Worker, **kwargs: Any) -> None:
app_base.wait_for_db(sender, **kwargs)
app_base.wait_for_vespa_or_shutdown(sender, **kwargs)
logger.info("Running as the primary celery worker.")
logger.info(f"Running as the primary celery worker: pid={os.getpid()}")
# Less startup checks in multi-tenant case
if MULTI_TENANT:

View File

@@ -0,0 +1,16 @@
import onyx.background.celery.configs.base as shared_config
broker_url = shared_config.broker_url
broker_connection_retry_on_startup = shared_config.broker_connection_retry_on_startup
broker_pool_limit = shared_config.broker_pool_limit
broker_transport_options = shared_config.broker_transport_options
redis_socket_keepalive = shared_config.redis_socket_keepalive
redis_retry_on_timeout = shared_config.redis_retry_on_timeout
redis_backend_health_check_interval = shared_config.redis_backend_health_check_interval
result_backend = shared_config.result_backend
result_expires = shared_config.result_expires # 86400 seconds is the default
task_default_priority = shared_config.task_default_priority
task_acks_late = shared_config.task_acks_late

View File

@@ -0,0 +1,20 @@
"""Factory stub for running celery worker / celery beat.
This code is different from the primary/beat stubs because there is no EE version to
fetch. Port over the code in those files if we add an EE version of this worker.
This is an app stub purely for sending tasks as a client.
"""
from celery import Celery
from onyx.utils.variable_functionality import set_is_ee_based_on_env_variable
set_is_ee_based_on_env_variable()
def get_app() -> Celery:
from onyx.background.celery.apps.client import celery_app
return celery_app
app = get_app()

View File

@@ -10,6 +10,7 @@ from onyx.agents.agent_search.models import GraphPersistence
from onyx.agents.agent_search.models import GraphSearchConfig
from onyx.agents.agent_search.models import GraphTooling
from onyx.agents.agent_search.run_graph import run_basic_graph
from onyx.agents.agent_search.run_graph import run_dc_graph
from onyx.agents.agent_search.run_graph import run_main_graph
from onyx.chat.models import AgentAnswerPiece
from onyx.chat.models import AnswerPacket
@@ -142,11 +143,18 @@ class Answer:
yield from self._processed_stream
return
run_langgraph = (
run_main_graph
if self.graph_config.behavior.use_agentic_search
else run_basic_graph
)
if self.graph_config.behavior.use_agentic_search:
run_langgraph = run_main_graph
elif (
self.graph_config.inputs.search_request.persona
and self.graph_config.inputs.search_request.persona.description.startswith(
"DivCon Beta Agent"
)
):
run_langgraph = run_dc_graph
else:
run_langgraph = run_basic_graph
stream = run_langgraph(
self.graph_config,
)

View File

@@ -43,6 +43,7 @@ from onyx.chat.prompt_builder.answer_prompt_builder import default_build_user_me
from onyx.configs.chat_configs import CHAT_TARGET_CHUNK_PERCENTAGE
from onyx.configs.chat_configs import DISABLE_LLM_CHOOSE_SEARCH
from onyx.configs.chat_configs import MAX_CHUNKS_FED_TO_CHAT
from onyx.configs.chat_configs import SELECTED_SECTIONS_MAX_WINDOW_PERCENTAGE
from onyx.configs.constants import AGENT_SEARCH_INITIAL_KEY
from onyx.configs.constants import BASIC_KEY
from onyx.configs.constants import MessageType
@@ -692,8 +693,13 @@ def stream_chat_message_objects(
doc_identifiers=identifier_tuples,
document_index=document_index,
)
# Add a maximum context size in the case of user-selected docs to prevent
# slight inaccuracies in context window size pruning from causing
# the entire query to fail
document_pruning_config = DocumentPruningConfig(
is_manually_selected_docs=True
is_manually_selected_docs=True,
max_window_percentage=SELECTED_SECTIONS_MAX_WINDOW_PERCENTAGE,
)
# In case the search doc is deleted, just don't include it

View File

@@ -312,11 +312,14 @@ def prune_sections(
)
def _merge_doc_chunks(chunks: list[InferenceChunk]) -> InferenceSection:
def _merge_doc_chunks(chunks: list[InferenceChunk]) -> tuple[InferenceSection, int]:
assert (
len(set([chunk.document_id for chunk in chunks])) == 1
), "One distinct document must be passed into merge_doc_chunks"
ADJACENT_CHUNK_SEP = "\n"
DISTANT_CHUNK_SEP = "\n\n...\n\n"
# Assuming there are no duplicates by this point
sorted_chunks = sorted(chunks, key=lambda x: x.chunk_id)
@@ -324,33 +327,48 @@ def _merge_doc_chunks(chunks: list[InferenceChunk]) -> InferenceSection:
chunks, key=lambda x: x.score if x.score is not None else float("-inf")
)
added_chars = 0
merged_content = []
for i, chunk in enumerate(sorted_chunks):
if i > 0:
prev_chunk_id = sorted_chunks[i - 1].chunk_id
if chunk.chunk_id == prev_chunk_id + 1:
merged_content.append("\n")
else:
merged_content.append("\n\n...\n\n")
sep = (
ADJACENT_CHUNK_SEP
if chunk.chunk_id == prev_chunk_id + 1
else DISTANT_CHUNK_SEP
)
merged_content.append(sep)
added_chars += len(sep)
merged_content.append(chunk.content)
combined_content = "".join(merged_content)
return InferenceSection(
center_chunk=center_chunk,
chunks=sorted_chunks,
combined_content=combined_content,
return (
InferenceSection(
center_chunk=center_chunk,
chunks=sorted_chunks,
combined_content=combined_content,
),
added_chars,
)
def _merge_sections(sections: list[InferenceSection]) -> list[InferenceSection]:
docs_map: dict[str, dict[int, InferenceChunk]] = defaultdict(dict)
doc_order: dict[str, int] = {}
combined_section_lengths: dict[str, int] = defaultdict(lambda: 0)
# chunk de-duping and doc ordering
for index, section in enumerate(sections):
if section.center_chunk.document_id not in doc_order:
doc_order[section.center_chunk.document_id] = index
combined_section_lengths[section.center_chunk.document_id] += len(
section.combined_content
)
chunks_map = docs_map[section.center_chunk.document_id]
for chunk in [section.center_chunk] + section.chunks:
chunks_map = docs_map[section.center_chunk.document_id]
existing_chunk = chunks_map.get(chunk.chunk_id)
if (
existing_chunk is None
@@ -361,8 +379,22 @@ def _merge_sections(sections: list[InferenceSection]) -> list[InferenceSection]:
chunks_map[chunk.chunk_id] = chunk
new_sections = []
for section_chunks in docs_map.values():
new_sections.append(_merge_doc_chunks(chunks=list(section_chunks.values())))
for doc_id, section_chunks in docs_map.items():
section_chunks_list = list(section_chunks.values())
merged_section, added_chars = _merge_doc_chunks(chunks=section_chunks_list)
previous_length = combined_section_lengths[doc_id] + added_chars
# After merging, ensure the content respects the pruning done earlier. Each
# combined section is restricted to the sum of the lengths of the sections
# from the pruning step. Technically the correct approach would be to prune based
# on tokens AGAIN, but this is a good approximation and worth not adding the
# tokenization overhead. This could also be fixed if we added a way of removing
# chunks from sections in the pruning step; at the moment this issue largely
# exists because we only trim the final section's combined_content.
merged_section.combined_content = merged_section.combined_content[
:previous_length
]
new_sections.append(merged_section)
# Sort by highest score, then by original document order
# It is now 1 large section per doc, the center chunk being the one with the highest score

View File

@@ -16,6 +16,9 @@ MAX_CHUNKS_FED_TO_CHAT = float(os.environ.get("MAX_CHUNKS_FED_TO_CHAT") or 10.0)
# ~3k input, half for docs, half for chat history + prompts
CHAT_TARGET_CHUNK_PERCENTAGE = 512 * 3 / 3072
# Maximum percentage of the context window to fill with selected sections
SELECTED_SECTIONS_MAX_WINDOW_PERCENTAGE = 0.8
# 1 / (1 + DOC_TIME_DECAY * doc-age-in-years), set to 0 to have no decay
# Capped in Vespa at 0.5
DOC_TIME_DECAY = float(

View File

@@ -13,6 +13,7 @@ from typing import TYPE_CHECKING
from typing import TypeVar
from urllib.parse import parse_qs
from urllib.parse import quote
from urllib.parse import urljoin
from urllib.parse import urlparse
import requests
@@ -342,9 +343,14 @@ def build_confluence_document_id(
Returns:
str: The document id
"""
if is_cloud and not base_url.endswith("/wiki"):
base_url += "/wiki"
return f"{base_url}{content_url}"
# NOTE: urljoin is tricky and will drop the last segment of the base if it doesn't
# end with "/" because it believes that makes it a file.
final_url = base_url.rstrip("/") + "/"
if is_cloud and not final_url.endswith("/wiki/"):
final_url = urljoin(final_url, "wiki") + "/"
final_url = urljoin(final_url, content_url.lstrip("/"))
return final_url
def datetime_from_string(datetime_string: str) -> datetime:
@@ -454,6 +460,19 @@ def _handle_http_error(e: requests.HTTPError, attempt: int) -> int:
logger.warning("HTTPError with `None` as response or as headers")
raise e
# Confluence Server returns 403 when rate limited
if e.response.status_code == 403:
FORBIDDEN_MAX_RETRY_ATTEMPTS = 7
FORBIDDEN_RETRY_DELAY = 10
if attempt < FORBIDDEN_MAX_RETRY_ATTEMPTS:
logger.warning(
"403 error. This sometimes happens when we hit "
f"Confluence rate limits. Retrying in {FORBIDDEN_RETRY_DELAY} seconds..."
)
return FORBIDDEN_RETRY_DELAY
raise e
if (
e.response.status_code != 429
and RATE_LIMIT_MESSAGE_LOWERCASE not in e.response.text.lower()

View File

@@ -1,4 +1,5 @@
import base64
import time
from collections.abc import Generator
from datetime import datetime
from datetime import timedelta
@@ -7,6 +8,8 @@ from typing import Any
from typing import cast
import requests
from requests.adapters import HTTPAdapter
from urllib3.util import Retry
from onyx.configs.app_configs import CONTINUE_ON_CONNECTOR_FAILURE
from onyx.configs.app_configs import GONG_CONNECTOR_START_TIME
@@ -21,13 +24,14 @@ from onyx.connectors.models import Document
from onyx.connectors.models import TextSection
from onyx.utils.logger import setup_logger
logger = setup_logger()
GONG_BASE_URL = "https://us-34014.api.gong.io"
class GongConnector(LoadConnector, PollConnector):
BASE_URL = "https://api.gong.io"
MAX_CALL_DETAILS_ATTEMPTS = 6
CALL_DETAILS_DELAY = 30 # in seconds
def __init__(
self,
workspaces: list[str] | None = None,
@@ -41,15 +45,23 @@ class GongConnector(LoadConnector, PollConnector):
self.auth_token_basic: str | None = None
self.hide_user_info = hide_user_info
def _get_auth_header(self) -> dict[str, str]:
if self.auth_token_basic is None:
raise ConnectorMissingCredentialError("Gong")
retry_strategy = Retry(
total=5,
backoff_factor=2,
status_forcelist=[429, 500, 502, 503, 504],
)
return {"Authorization": f"Basic {self.auth_token_basic}"}
session = requests.Session()
session.mount(GongConnector.BASE_URL, HTTPAdapter(max_retries=retry_strategy))
self._session = session
@staticmethod
def make_url(endpoint: str) -> str:
url = f"{GongConnector.BASE_URL}{endpoint}"
return url
def _get_workspace_id_map(self) -> dict[str, str]:
url = f"{GONG_BASE_URL}/v2/workspaces"
response = requests.get(url, headers=self._get_auth_header())
response = self._session.get(GongConnector.make_url("/v2/workspaces"))
response.raise_for_status()
workspaces_details = response.json().get("workspaces")
@@ -66,7 +78,6 @@ class GongConnector(LoadConnector, PollConnector):
def _get_transcript_batches(
self, start_datetime: str | None = None, end_datetime: str | None = None
) -> Generator[list[dict[str, Any]], None, None]:
url = f"{GONG_BASE_URL}/v2/calls/transcript"
body: dict[str, dict] = {"filter": {}}
if start_datetime:
body["filter"]["fromDateTime"] = start_datetime
@@ -94,8 +105,8 @@ class GongConnector(LoadConnector, PollConnector):
del body["filter"]["workspaceId"]
while True:
response = requests.post(
url, headers=self._get_auth_header(), json=body
response = self._session.post(
GongConnector.make_url("/v2/calls/transcript"), json=body
)
# If no calls in the range, just break out
if response.status_code == 404:
@@ -125,14 +136,14 @@ class GongConnector(LoadConnector, PollConnector):
yield transcripts
def _get_call_details_by_ids(self, call_ids: list[str]) -> dict:
url = f"{GONG_BASE_URL}/v2/calls/extensive"
body = {
"filter": {"callIds": call_ids},
"contentSelector": {"exposedFields": {"parties": True}},
}
response = requests.post(url, headers=self._get_auth_header(), json=body)
response = self._session.post(
GongConnector.make_url("/v2/calls/extensive"), json=body
)
response.raise_for_status()
calls = response.json().get("calls")
@@ -165,24 +176,74 @@ class GongConnector(LoadConnector, PollConnector):
def _fetch_calls(
self, start_datetime: str | None = None, end_datetime: str | None = None
) -> GenerateDocumentsOutput:
num_calls = 0
for transcript_batch in self._get_transcript_batches(
start_datetime, end_datetime
):
doc_batch: list[Document] = []
call_ids = cast(
transcript_call_ids = cast(
list[str],
[t.get("callId") for t in transcript_batch if t.get("callId")],
)
call_details_map = self._get_call_details_by_ids(call_ids)
call_details_map: dict[str, Any] = {}
# There's a likely race condition in the API where a transcript will have a
# call id but the call to v2/calls/extensive will not return all of the id's
# retry with exponential backoff has been observed to mitigate this
# in ~2 minutes
current_attempt = 0
while True:
current_attempt += 1
call_details_map = self._get_call_details_by_ids(transcript_call_ids)
if set(transcript_call_ids) == set(call_details_map.keys()):
# we got all the id's we were expecting ... break and continue
break
# we are missing some id's. Log and retry with exponential backoff
missing_call_ids = set(transcript_call_ids) - set(
call_details_map.keys()
)
logger.warning(
f"_get_call_details_by_ids is missing call id's: "
f"current_attempt={current_attempt} "
f"missing_call_ids={missing_call_ids}"
)
if current_attempt >= self.MAX_CALL_DETAILS_ATTEMPTS:
raise RuntimeError(
f"Attempt count exceeded for _get_call_details_by_ids: "
f"missing_call_ids={missing_call_ids} "
f"max_attempts={self.MAX_CALL_DETAILS_ATTEMPTS}"
)
wait_seconds = self.CALL_DETAILS_DELAY * pow(2, current_attempt - 1)
logger.warning(
f"_get_call_details_by_ids waiting to retry: "
f"wait={wait_seconds}s "
f"current_attempt={current_attempt} "
f"next_attempt={current_attempt+1} "
f"max_attempts={self.MAX_CALL_DETAILS_ATTEMPTS}"
)
time.sleep(wait_seconds)
# now we can iterate per call/transcript
for transcript in transcript_batch:
call_id = transcript.get("callId")
if not call_id or call_id not in call_details_map:
# NOTE(rkuo): seeing odd behavior where call_ids from the transcript
# don't have call details. adding error debugging logs to trace.
logger.error(
f"Couldn't get call information for Call ID: {call_id}"
)
if call_id:
logger.error(
f"Call debug info: call_id={call_id} "
f"call_ids={transcript_call_ids} "
f"call_details_map={call_details_map.keys()}"
)
if not self.continue_on_fail:
raise RuntimeError(
f"Couldn't get call information for Call ID: {call_id}"
@@ -195,7 +256,8 @@ class GongConnector(LoadConnector, PollConnector):
call_time_str = call_metadata["started"]
call_title = call_metadata["title"]
logger.info(
f"Indexing Gong call from {call_time_str.split('T', 1)[0]}: {call_title}"
f"{num_calls+1}: Indexing Gong call id {call_id} "
f"from {call_time_str.split('T', 1)[0]}: {call_title}"
)
call_parties = cast(list[dict] | None, call_details.get("parties"))
@@ -254,8 +316,13 @@ class GongConnector(LoadConnector, PollConnector):
metadata={"client": call_metadata.get("system")},
)
)
num_calls += 1
yield doc_batch
logger.info(f"_fetch_calls finished: num_calls={num_calls}")
def load_credentials(self, credentials: dict[str, Any]) -> dict[str, Any] | None:
combined = (
f'{credentials["gong_access_key"]}:{credentials["gong_access_key_secret"]}'
@@ -263,6 +330,13 @@ class GongConnector(LoadConnector, PollConnector):
self.auth_token_basic = base64.b64encode(combined.encode("utf-8")).decode(
"utf-8"
)
if self.auth_token_basic is None:
raise ConnectorMissingCredentialError("Gong")
self._session.headers.update(
{"Authorization": f"Basic {self.auth_token_basic}"}
)
return None
def load_from_state(self) -> GenerateDocumentsOutput:

View File

@@ -445,6 +445,9 @@ class GoogleDriveConnector(SlimConnector, CheckpointConnector[GoogleDriveCheckpo
logger.warning(
f"User '{user_email}' does not have access to the drive APIs."
)
# mark this user as done so we don't try to retrieve anything for them
# again
curr_stage.stage = DriveRetrievalStage.DONE
return
raise
@@ -581,6 +584,25 @@ class GoogleDriveConnector(SlimConnector, CheckpointConnector[GoogleDriveCheckpo
drive_ids_to_retrieve, checkpoint
)
# only process emails that we haven't already completed retrieval for
non_completed_org_emails = [
user_email
for user_email, stage in checkpoint.completion_map.items()
if stage != DriveRetrievalStage.DONE
]
# don't process too many emails before returning a checkpoint. This is
# to resolve the case where there are a ton of emails that don't have access
# to the drive APIs. Without this, we could loop through these emails for
# more than 3 hours, causing a timeout and stalling progress.
email_batch_takes_us_to_completion = True
MAX_EMAILS_TO_PROCESS_BEFORE_CHECKPOINTING = 50
if len(non_completed_org_emails) > MAX_EMAILS_TO_PROCESS_BEFORE_CHECKPOINTING:
non_completed_org_emails = non_completed_org_emails[
:MAX_EMAILS_TO_PROCESS_BEFORE_CHECKPOINTING
]
email_batch_takes_us_to_completion = False
user_retrieval_gens = [
self._impersonate_user_for_retrieval(
email,
@@ -591,10 +613,14 @@ class GoogleDriveConnector(SlimConnector, CheckpointConnector[GoogleDriveCheckpo
start,
end,
)
for email in all_org_emails
for email in non_completed_org_emails
]
yield from parallel_yield(user_retrieval_gens, max_workers=MAX_DRIVE_WORKERS)
# if there are more emails to process, don't mark as complete
if not email_batch_takes_us_to_completion:
return
remaining_folders = (
drive_ids_to_retrieve | folder_ids_to_retrieve
) - self._retrieved_ids

View File

@@ -20,7 +20,8 @@ from onyx.connectors.models import ConnectorMissingCredentialError
from onyx.connectors.models import Document
from onyx.connectors.models import SlimDocument
from onyx.connectors.models import TextSection
from onyx.file_processing.extract_file_text import ALL_ACCEPTED_FILE_EXTENSIONS
from onyx.file_processing.extract_file_text import ACCEPTED_DOCUMENT_FILE_EXTENSIONS
from onyx.file_processing.extract_file_text import ACCEPTED_PLAIN_TEXT_FILE_EXTENSIONS
from onyx.file_processing.extract_file_text import extract_file_text
from onyx.indexing.indexing_heartbeat import IndexingHeartbeatInterface
from onyx.utils.logger import setup_logger
@@ -84,14 +85,21 @@ class HighspotConnector(LoadConnector, PollConnector, SlimConnector):
Populate the spot ID map with all available spots.
Keys are stored as lowercase for case-insensitive lookups.
"""
spots = self.client.get_spots()
for spot in spots:
if "title" in spot and "id" in spot:
spot_name = spot["title"]
self._spot_id_map[spot_name.lower()] = spot["id"]
try:
spots = self.client.get_spots()
for spot in spots:
if "title" in spot and "id" in spot:
spot_name = spot["title"]
self._spot_id_map[spot_name.lower()] = spot["id"]
self._all_spots_fetched = True
logger.info(f"Retrieved {len(self._spot_id_map)} spots from Highspot")
self._all_spots_fetched = True
logger.info(f"Retrieved {len(self._spot_id_map)} spots from Highspot")
except HighspotClientError as e:
logger.error(f"Error retrieving spots from Highspot: {str(e)}")
raise
except Exception as e:
logger.error(f"Unexpected error retrieving spots from Highspot: {str(e)}")
raise
def _get_all_spot_names(self) -> List[str]:
"""
@@ -151,116 +159,142 @@ class HighspotConnector(LoadConnector, PollConnector, SlimConnector):
Batches of Document objects
"""
doc_batch: list[Document] = []
try:
# If no spots specified, get all spots
spot_names_to_process = self.spot_names
if not spot_names_to_process:
spot_names_to_process = self._get_all_spot_names()
if not spot_names_to_process:
logger.warning("No spots found in Highspot")
raise ValueError("No spots found in Highspot")
logger.info(
f"No spots specified, using all {len(spot_names_to_process)} available spots"
)
# If no spots specified, get all spots
spot_names_to_process = self.spot_names
if not spot_names_to_process:
spot_names_to_process = self._get_all_spot_names()
logger.info(
f"No spots specified, using all {len(spot_names_to_process)} available spots"
)
for spot_name in spot_names_to_process:
try:
spot_id = self._get_spot_id_from_name(spot_name)
if spot_id is None:
logger.warning(f"Spot ID not found for spot {spot_name}")
continue
offset = 0
has_more = True
while has_more:
logger.info(
f"Retrieving items from spot {spot_name}, offset {offset}"
)
response = self.client.get_spot_items(
spot_id=spot_id, offset=offset, page_size=self.batch_size
)
items = response.get("collection", [])
logger.info(f"Received Items: {items}")
if not items:
has_more = False
for spot_name in spot_names_to_process:
try:
spot_id = self._get_spot_id_from_name(spot_name)
if spot_id is None:
logger.warning(f"Spot ID not found for spot {spot_name}")
continue
offset = 0
has_more = True
for item in items:
try:
item_id = item.get("id")
if not item_id:
logger.warning("Item without ID found, skipping")
continue
while has_more:
logger.info(
f"Retrieving items from spot {spot_name}, offset {offset}"
)
response = self.client.get_spot_items(
spot_id=spot_id, offset=offset, page_size=self.batch_size
)
items = response.get("collection", [])
logger.info(f"Received Items: {items}")
if not items:
has_more = False
continue
item_details = self.client.get_item(item_id)
if not item_details:
logger.warning(
f"Item {item_id} details not found, skipping"
)
continue
# Apply time filter if specified
if start or end:
updated_at = item_details.get("date_updated")
if updated_at:
# Convert to datetime for comparison
try:
updated_time = datetime.fromisoformat(
updated_at.replace("Z", "+00:00")
)
if (
start and updated_time.timestamp() < start
) or (end and updated_time.timestamp() > end):
for item in items:
try:
item_id = item.get("id")
if not item_id:
logger.warning("Item without ID found, skipping")
continue
item_details = self.client.get_item(item_id)
if not item_details:
logger.warning(
f"Item {item_id} details not found, skipping"
)
continue
# Apply time filter if specified
if start or end:
updated_at = item_details.get("date_updated")
if updated_at:
# Convert to datetime for comparison
try:
updated_time = datetime.fromisoformat(
updated_at.replace("Z", "+00:00")
)
if (
start
and updated_time.timestamp() < start
) or (
end and updated_time.timestamp() > end
):
continue
except (ValueError, TypeError):
# Skip if date cannot be parsed
logger.warning(
f"Invalid date format for item {item_id}: {updated_at}"
)
continue
except (ValueError, TypeError):
# Skip if date cannot be parsed
logger.warning(
f"Invalid date format for item {item_id}: {updated_at}"
)
continue
content = self._get_item_content(item_details)
title = item_details.get("title", "")
content = self._get_item_content(item_details)
doc_batch.append(
Document(
id=f"HIGHSPOT_{item_id}",
sections=[
TextSection(
link=item_details.get(
"url",
f"https://www.highspot.com/items/{item_id}",
title = item_details.get("title", "")
doc_batch.append(
Document(
id=f"HIGHSPOT_{item_id}",
sections=[
TextSection(
link=item_details.get(
"url",
f"https://www.highspot.com/items/{item_id}",
),
text=content,
)
],
source=DocumentSource.HIGHSPOT,
semantic_identifier=title,
metadata={
"spot_name": spot_name,
"type": item_details.get(
"content_type", ""
),
text=content,
)
],
source=DocumentSource.HIGHSPOT,
semantic_identifier=title,
metadata={
"spot_name": spot_name,
"type": item_details.get("content_type", ""),
"created_at": item_details.get(
"date_added", ""
),
"author": item_details.get("author", ""),
"language": item_details.get("language", ""),
"can_download": str(
item_details.get("can_download", False)
),
},
doc_updated_at=item_details.get("date_updated"),
"created_at": item_details.get(
"date_added", ""
),
"author": item_details.get("author", ""),
"language": item_details.get(
"language", ""
),
"can_download": str(
item_details.get("can_download", False)
),
},
doc_updated_at=item_details.get("date_updated"),
)
)
)
if len(doc_batch) >= self.batch_size:
yield doc_batch
doc_batch = []
if len(doc_batch) >= self.batch_size:
yield doc_batch
doc_batch = []
except HighspotClientError as e:
item_id = "ID" if not item_id else item_id
logger.error(f"Error retrieving item {item_id}: {str(e)}")
except HighspotClientError as e:
item_id = "ID" if not item_id else item_id
logger.error(
f"Error retrieving item {item_id}: {str(e)}"
)
except Exception as e:
item_id = "ID" if not item_id else item_id
logger.error(
f"Unexpected error for item {item_id}: {str(e)}"
)
has_more = len(items) >= self.batch_size
offset += self.batch_size
has_more = len(items) >= self.batch_size
offset += self.batch_size
except (HighspotClientError, ValueError) as e:
logger.error(f"Error processing spot {spot_name}: {str(e)}")
except (HighspotClientError, ValueError) as e:
logger.error(f"Error processing spot {spot_name}: {str(e)}")
except Exception as e:
logger.error(
f"Unexpected error processing spot {spot_name}: {str(e)}"
)
except Exception as e:
logger.error(f"Error in Highspot connector: {str(e)}")
raise
if doc_batch:
yield doc_batch
@@ -286,7 +320,9 @@ class HighspotConnector(LoadConnector, PollConnector, SlimConnector):
# Extract title and description once at the beginning
title, description = self._extract_title_and_description(item_details)
default_content = f"{title}\n{description}"
logger.info(f"Processing item {item_id} with extension {file_extension}")
logger.info(
f"Processing item {item_id} with extension {file_extension} and file name {content_name}"
)
try:
if content_type == "WebLink":
@@ -298,30 +334,39 @@ class HighspotConnector(LoadConnector, PollConnector, SlimConnector):
elif (
is_valid_format
and file_extension in ALL_ACCEPTED_FILE_EXTENSIONS
and (
file_extension in ACCEPTED_PLAIN_TEXT_FILE_EXTENSIONS
or file_extension in ACCEPTED_DOCUMENT_FILE_EXTENSIONS
)
and can_download
):
# For documents, try to get the text content
if not item_id: # Ensure item_id is defined
return default_content
content_response = self.client.get_item_content(item_id)
# Process and extract text from binary content based on type
if content_response:
text_content = extract_file_text(
BytesIO(content_response), content_name
BytesIO(content_response), content_name, False
)
return text_content
return text_content if text_content else default_content
return default_content
else:
return default_content
except HighspotClientError as e:
# Use item_id safely in the warning message
error_context = f"item {item_id}" if item_id else "item"
error_context = f"item {item_id}" if item_id else "(item id not found)"
logger.warning(f"Could not retrieve content for {error_context}: {str(e)}")
return ""
return default_content
except ValueError as e:
error_context = f"item {item_id}" if item_id else "(item id not found)"
logger.error(f"Value error for {error_context}: {str(e)}")
return default_content
except Exception as e:
error_context = f"item {item_id}" if item_id else "(item id not found)"
logger.error(
f"Unexpected error retrieving content for {error_context}: {str(e)}"
)
return default_content
def _extract_title_and_description(
self, item_details: Dict[str, Any]
@@ -358,55 +403,63 @@ class HighspotConnector(LoadConnector, PollConnector, SlimConnector):
Batches of SlimDocument objects
"""
slim_doc_batch: list[SlimDocument] = []
# If no spots specified, get all spots
spot_names_to_process = self.spot_names
if not spot_names_to_process:
spot_names_to_process = self._get_all_spot_names()
logger.info(
f"No spots specified, using all {len(spot_names_to_process)} available spots for slim documents"
)
for spot_name in spot_names_to_process:
try:
spot_id = self._get_spot_id_from_name(spot_name)
offset = 0
has_more = True
while has_more:
logger.info(
f"Retrieving slim documents from spot {spot_name}, offset {offset}"
)
response = self.client.get_spot_items(
spot_id=spot_id, offset=offset, page_size=self.batch_size
)
items = response.get("collection", [])
if not items:
has_more = False
continue
for item in items:
item_id = item.get("id")
if not item_id:
continue
slim_doc_batch.append(SlimDocument(id=f"HIGHSPOT_{item_id}"))
if len(slim_doc_batch) >= _SLIM_BATCH_SIZE:
yield slim_doc_batch
slim_doc_batch = []
has_more = len(items) >= self.batch_size
offset += self.batch_size
except (HighspotClientError, ValueError) as e:
logger.error(
f"Error retrieving slim documents from spot {spot_name}: {str(e)}"
try:
# If no spots specified, get all spots
spot_names_to_process = self.spot_names
if not spot_names_to_process:
spot_names_to_process = self._get_all_spot_names()
if not spot_names_to_process:
logger.warning("No spots found in Highspot")
raise ValueError("No spots found in Highspot")
logger.info(
f"No spots specified, using all {len(spot_names_to_process)} available spots for slim documents"
)
if slim_doc_batch:
yield slim_doc_batch
for spot_name in spot_names_to_process:
try:
spot_id = self._get_spot_id_from_name(spot_name)
offset = 0
has_more = True
while has_more:
logger.info(
f"Retrieving slim documents from spot {spot_name}, offset {offset}"
)
response = self.client.get_spot_items(
spot_id=spot_id, offset=offset, page_size=self.batch_size
)
items = response.get("collection", [])
if not items:
has_more = False
continue
for item in items:
item_id = item.get("id")
if not item_id:
continue
slim_doc_batch.append(
SlimDocument(id=f"HIGHSPOT_{item_id}")
)
if len(slim_doc_batch) >= _SLIM_BATCH_SIZE:
yield slim_doc_batch
slim_doc_batch = []
has_more = len(items) >= self.batch_size
offset += self.batch_size
except (HighspotClientError, ValueError) as e:
logger.error(
f"Error retrieving slim documents from spot {spot_name}: {str(e)}"
)
if slim_doc_batch:
yield slim_doc_batch
except Exception as e:
logger.error(f"Error in Highspot Slim Connector: {str(e)}")
raise
def validate_credentials(self) -> bool:
"""

View File

@@ -1,3 +1,4 @@
import sys
from datetime import datetime
from enum import Enum
from typing import Any
@@ -40,6 +41,9 @@ class TextSection(Section):
text: str
link: str | None = None
def __sizeof__(self) -> int:
return sys.getsizeof(self.text) + sys.getsizeof(self.link)
class ImageSection(Section):
"""Section containing an image reference"""
@@ -47,6 +51,9 @@ class ImageSection(Section):
image_file_name: str
link: str | None = None
def __sizeof__(self) -> int:
return sys.getsizeof(self.image_file_name) + sys.getsizeof(self.link)
class BasicExpertInfo(BaseModel):
"""Basic Information for the owner of a document, any of the fields can be left as None
@@ -110,6 +117,14 @@ class BasicExpertInfo(BaseModel):
)
)
def __sizeof__(self) -> int:
size = sys.getsizeof(self.display_name)
size += sys.getsizeof(self.first_name)
size += sys.getsizeof(self.middle_initial)
size += sys.getsizeof(self.last_name)
size += sys.getsizeof(self.email)
return size
class DocumentBase(BaseModel):
"""Used for Onyx ingestion api, the ID is inferred before use if not provided"""
@@ -163,6 +178,32 @@ class DocumentBase(BaseModel):
attributes.append(k + INDEX_SEPARATOR + v)
return attributes
def __sizeof__(self) -> int:
size = sys.getsizeof(self.id)
for section in self.sections:
size += sys.getsizeof(section)
size += sys.getsizeof(self.source)
size += sys.getsizeof(self.semantic_identifier)
size += sys.getsizeof(self.doc_updated_at)
size += sys.getsizeof(self.chunk_count)
if self.primary_owners is not None:
for primary_owner in self.primary_owners:
size += sys.getsizeof(primary_owner)
else:
size += sys.getsizeof(self.primary_owners)
if self.secondary_owners is not None:
for secondary_owner in self.secondary_owners:
size += sys.getsizeof(secondary_owner)
else:
size += sys.getsizeof(self.secondary_owners)
size += sys.getsizeof(self.title)
size += sys.getsizeof(self.from_ingestion_api)
size += sys.getsizeof(self.additional_info)
return size
def get_text_content(self) -> str:
return " ".join([section.text for section in self.sections if section.text])
@@ -194,6 +235,12 @@ class Document(DocumentBase):
from_ingestion_api=base.from_ingestion_api,
)
def __sizeof__(self) -> int:
size = super().__sizeof__()
size += sys.getsizeof(self.id)
size += sys.getsizeof(self.source)
return size
class IndexingDocument(Document):
"""Document with processed sections for indexing"""

View File

@@ -1,4 +1,9 @@
import gc
import os
import sys
import tempfile
from collections import defaultdict
from pathlib import Path
from typing import Any
from simple_salesforce import Salesforce
@@ -21,9 +26,13 @@ from onyx.connectors.salesforce.salesforce_calls import get_all_children_of_sf_t
from onyx.connectors.salesforce.sqlite_functions import get_affected_parent_ids_by_type
from onyx.connectors.salesforce.sqlite_functions import get_record
from onyx.connectors.salesforce.sqlite_functions import init_db
from onyx.connectors.salesforce.sqlite_functions import sqlite_log_stats
from onyx.connectors.salesforce.sqlite_functions import update_sf_db_with_csv
from onyx.connectors.salesforce.utils import BASE_DATA_PATH
from onyx.connectors.salesforce.utils import get_sqlite_db_path
from onyx.indexing.indexing_heartbeat import IndexingHeartbeatInterface
from onyx.utils.logger import setup_logger
from shared_configs.configs import MULTI_TENANT
logger = setup_logger()
@@ -32,6 +41,8 @@ _DEFAULT_PARENT_OBJECT_TYPES = ["Account"]
class SalesforceConnector(LoadConnector, PollConnector, SlimConnector):
MAX_BATCH_BYTES = 1024 * 1024
def __init__(
self,
batch_size: int = INDEX_BATCH_SIZE,
@@ -64,22 +75,45 @@ class SalesforceConnector(LoadConnector, PollConnector, SlimConnector):
raise ConnectorMissingCredentialError("Salesforce")
return self._sf_client
def _fetch_from_salesforce(
self,
@staticmethod
def reconstruct_object_types(directory: str) -> dict[str, list[str] | None]:
"""
Scans the given directory for all CSV files and reconstructs the available object types.
Assumes filenames are formatted as "ObjectType.filename.csv" or "ObjectType.csv".
Args:
directory (str): The path to the directory containing CSV files.
Returns:
dict[str, list[str]]: A dictionary mapping object types to lists of file paths.
"""
object_types = defaultdict(list)
for filename in os.listdir(directory):
if filename.endswith(".csv"):
parts = filename.split(".", 1) # Split on the first period
object_type = parts[0] # Take the first part as the object type
object_types[object_type].append(os.path.join(directory, filename))
return dict(object_types)
@staticmethod
def _download_object_csvs(
directory: str,
parent_object_list: list[str],
sf_client: Salesforce,
start: SecondsSinceUnixEpoch | None = None,
end: SecondsSinceUnixEpoch | None = None,
) -> GenerateDocumentsOutput:
init_db()
all_object_types: set[str] = set(self.parent_object_list)
) -> None:
all_object_types: set[str] = set(parent_object_list)
logger.info(f"Starting with {len(self.parent_object_list)} parent object types")
logger.debug(f"Parent object types: {self.parent_object_list}")
logger.info(
f"Parent object types: num={len(parent_object_list)} list={parent_object_list}"
)
# This takes like 20 seconds
for parent_object_type in self.parent_object_list:
child_types = get_all_children_of_sf_type(
self.sf_client, parent_object_type
)
for parent_object_type in parent_object_list:
child_types = get_all_children_of_sf_type(sf_client, parent_object_type)
all_object_types.update(child_types)
logger.debug(
f"Found {len(child_types)} child types for {parent_object_type}"
@@ -88,20 +122,53 @@ class SalesforceConnector(LoadConnector, PollConnector, SlimConnector):
# Always want to make sure user is grabbed for permissioning purposes
all_object_types.add("User")
logger.info(f"Found total of {len(all_object_types)} object types to fetch")
logger.debug(f"All object types: {all_object_types}")
logger.info(
f"All object types: num={len(all_object_types)} list={all_object_types}"
)
# gc.collect()
# checkpoint - we've found all object types, now time to fetch the data
logger.info("Starting to fetch CSVs for all object types")
logger.info("Fetching CSVs for all object types")
# This takes like 30 minutes first time and <2 minutes for updates
object_type_to_csv_path = fetch_all_csvs_in_parallel(
sf_client=self.sf_client,
sf_client=sf_client,
object_types=all_object_types,
start=start,
end=end,
target_dir=directory,
)
# print useful information
num_csvs = 0
num_bytes = 0
for object_type, csv_paths in object_type_to_csv_path.items():
if not csv_paths:
continue
for csv_path in csv_paths:
if not csv_path:
continue
file_path = Path(csv_path)
file_size = file_path.stat().st_size
num_csvs += 1
num_bytes += file_size
logger.info(
f"CSV info: object_type={object_type} path={csv_path} bytes={file_size}"
)
logger.info(f"CSV info total: total_csvs={num_csvs} total_bytes={num_bytes}")
@staticmethod
def _load_csvs_to_db(csv_directory: str, db_directory: str) -> set[str]:
updated_ids: set[str] = set()
object_type_to_csv_path = SalesforceConnector.reconstruct_object_types(
csv_directory
)
# This takes like 10 seconds
# This is for testing the rest of the functionality if data has
# already been fetched and put in sqlite
@@ -120,10 +187,16 @@ class SalesforceConnector(LoadConnector, PollConnector, SlimConnector):
# If path is None, it means it failed to fetch the csv
if csv_paths is None:
continue
# Go through each csv path and use it to update the db
for csv_path in csv_paths:
logger.debug(f"Updating {object_type} with {csv_path}")
logger.debug(
f"Processing CSV: object_type={object_type} "
f"csv={csv_path} "
f"len={Path(csv_path).stat().st_size}"
)
new_ids = update_sf_db_with_csv(
db_directory,
object_type=object_type,
csv_download_path=csv_path,
)
@@ -132,49 +205,127 @@ class SalesforceConnector(LoadConnector, PollConnector, SlimConnector):
f"Added {len(new_ids)} new/updated records for {object_type}"
)
os.remove(csv_path)
return updated_ids
def _fetch_from_salesforce(
self,
temp_dir: str,
start: SecondsSinceUnixEpoch | None = None,
end: SecondsSinceUnixEpoch | None = None,
) -> GenerateDocumentsOutput:
logger.info("_fetch_from_salesforce starting.")
if not self._sf_client:
raise RuntimeError("self._sf_client is None!")
init_db(temp_dir)
sqlite_log_stats(temp_dir)
# Step 1 - download
SalesforceConnector._download_object_csvs(
temp_dir, self.parent_object_list, self._sf_client, start, end
)
gc.collect()
# Step 2 - load CSV's to sqlite
updated_ids = SalesforceConnector._load_csvs_to_db(temp_dir, temp_dir)
gc.collect()
logger.info(f"Found {len(updated_ids)} total updated records")
logger.info(
f"Starting to process parent objects of types: {self.parent_object_list}"
)
docs_to_yield: list[Document] = []
# Step 3 - extract and index docs
batches_processed = 0
docs_processed = 0
docs_to_yield: list[Document] = []
docs_to_yield_bytes = 0
# Takes 15-20 seconds per batch
for parent_type, parent_id_batch in get_affected_parent_ids_by_type(
temp_dir,
updated_ids=list(updated_ids),
parent_types=self.parent_object_list,
):
batches_processed += 1
logger.info(
f"Processing batch of {len(parent_id_batch)} {parent_type} objects"
f"Processing batch: index={batches_processed} "
f"object_type={parent_type} "
f"len={len(parent_id_batch)} "
f"processed={docs_processed} "
f"remaining={len(updated_ids) - docs_processed}"
)
for parent_id in parent_id_batch:
if not (parent_object := get_record(parent_id, parent_type)):
if not (parent_object := get_record(temp_dir, parent_id, parent_type)):
logger.warning(
f"Failed to get parent object {parent_id} for {parent_type}"
)
continue
docs_to_yield.append(
convert_sf_object_to_doc(
sf_object=parent_object,
sf_instance=self.sf_client.sf_instance,
)
doc = convert_sf_object_to_doc(
temp_dir,
sf_object=parent_object,
sf_instance=self.sf_client.sf_instance,
)
doc_sizeof = sys.getsizeof(doc)
docs_to_yield_bytes += doc_sizeof
docs_to_yield.append(doc)
docs_processed += 1
if len(docs_to_yield) >= self.batch_size:
# memory usage is sensitive to the input length, so we're yielding immediately
# if the batch exceeds a certain byte length
if (
len(docs_to_yield) >= self.batch_size
or docs_to_yield_bytes > SalesforceConnector.MAX_BATCH_BYTES
):
yield docs_to_yield
docs_to_yield = []
docs_to_yield_bytes = 0
# observed a memory leak / size issue with the account table if we don't gc.collect here.
gc.collect()
yield docs_to_yield
logger.info(
f"Final processing stats: "
f"processed={docs_processed} "
f"remaining={len(updated_ids) - docs_processed}"
)
def load_from_state(self) -> GenerateDocumentsOutput:
return self._fetch_from_salesforce()
if MULTI_TENANT:
# if multi tenant, we cannot expect the sqlite db to be cached/present
with tempfile.TemporaryDirectory() as temp_dir:
return self._fetch_from_salesforce(temp_dir)
# nuke the db since we're starting from scratch
sqlite_db_path = get_sqlite_db_path(BASE_DATA_PATH)
if os.path.exists(sqlite_db_path):
logger.info(f"load_from_state: Removing db at {sqlite_db_path}.")
os.remove(sqlite_db_path)
return self._fetch_from_salesforce(BASE_DATA_PATH)
def poll_source(
self, start: SecondsSinceUnixEpoch, end: SecondsSinceUnixEpoch
) -> GenerateDocumentsOutput:
return self._fetch_from_salesforce(start=start, end=end)
if MULTI_TENANT:
# if multi tenant, we cannot expect the sqlite db to be cached/present
with tempfile.TemporaryDirectory() as temp_dir:
return self._fetch_from_salesforce(temp_dir, start=start, end=end)
if start == 0:
# nuke the db if we're starting from scratch
sqlite_db_path = get_sqlite_db_path(BASE_DATA_PATH)
if os.path.exists(sqlite_db_path):
logger.info(
f"poll_source: Starting at time 0, removing db at {sqlite_db_path}."
)
os.remove(sqlite_db_path)
return self._fetch_from_salesforce(BASE_DATA_PATH)
def retrieve_all_slim_documents(
self,
@@ -209,7 +360,7 @@ if __name__ == "__main__":
"sf_security_token": os.environ["SF_SECURITY_TOKEN"],
}
)
start_time = time.time()
start_time = time.monotonic()
doc_count = 0
section_count = 0
text_count = 0
@@ -221,7 +372,7 @@ if __name__ == "__main__":
for section in doc.sections:
if isinstance(section, TextSection) and section.text is not None:
text_count += len(section.text)
end_time = time.time()
end_time = time.monotonic()
print(f"Doc count: {doc_count}")
print(f"Section count: {section_count}")

View File

@@ -124,13 +124,14 @@ def _extract_section(salesforce_object: SalesforceObject, base_url: str) -> Text
def _extract_primary_owners(
directory: str,
sf_object: SalesforceObject,
) -> list[BasicExpertInfo] | None:
object_dict = sf_object.data
if not (last_modified_by_id := object_dict.get("LastModifiedById")):
logger.warning(f"No LastModifiedById found for {sf_object.id}")
return None
if not (last_modified_by := get_record(last_modified_by_id)):
if not (last_modified_by := get_record(directory, last_modified_by_id)):
logger.warning(f"No LastModifiedBy found for {last_modified_by_id}")
return None
@@ -159,6 +160,7 @@ def _extract_primary_owners(
def convert_sf_object_to_doc(
directory: str,
sf_object: SalesforceObject,
sf_instance: str,
) -> Document:
@@ -170,8 +172,8 @@ def convert_sf_object_to_doc(
extracted_semantic_identifier = object_dict.get("Name", "Unknown Object")
sections = [_extract_section(sf_object, base_url)]
for id in get_child_ids(sf_object.id):
if not (child_object := get_record(id)):
for id in get_child_ids(directory, sf_object.id):
if not (child_object := get_record(directory, id)):
continue
sections.append(_extract_section(child_object, base_url))
@@ -181,7 +183,7 @@ def convert_sf_object_to_doc(
source=DocumentSource.SALESFORCE,
semantic_identifier=extracted_semantic_identifier,
doc_updated_at=extracted_doc_updated_at,
primary_owners=_extract_primary_owners(sf_object),
primary_owners=_extract_primary_owners(directory, sf_object),
metadata={},
)
return doc

View File

@@ -11,13 +11,12 @@ from simple_salesforce.bulk2 import SFBulk2Type
from onyx.connectors.interfaces import SecondsSinceUnixEpoch
from onyx.connectors.salesforce.sqlite_functions import has_at_least_one_object_of_type
from onyx.connectors.salesforce.utils import get_object_type_path
from onyx.utils.logger import setup_logger
logger = setup_logger()
def _build_time_filter_for_salesforce(
def _build_last_modified_time_filter_for_salesforce(
start: SecondsSinceUnixEpoch | None, end: SecondsSinceUnixEpoch | None
) -> str:
if start is None or end is None:
@@ -30,6 +29,19 @@ def _build_time_filter_for_salesforce(
)
def _build_created_date_time_filter_for_salesforce(
start: SecondsSinceUnixEpoch | None, end: SecondsSinceUnixEpoch | None
) -> str:
if start is None or end is None:
return ""
start_datetime = datetime.fromtimestamp(start, UTC)
end_datetime = datetime.fromtimestamp(end, UTC)
return (
f" WHERE CreatedDate > {start_datetime.isoformat()} "
f"AND CreatedDate < {end_datetime.isoformat()}"
)
def _get_sf_type_object_json(sf_client: Salesforce, type_name: str) -> Any:
sf_object = SFType(type_name, sf_client.session_id, sf_client.sf_instance)
return sf_object.describe()
@@ -109,23 +121,6 @@ def _check_if_object_type_is_empty(
return True
def _check_for_existing_csvs(sf_type: str) -> list[str] | None:
# Check if the csv already exists
if os.path.exists(get_object_type_path(sf_type)):
existing_csvs = [
os.path.join(get_object_type_path(sf_type), f)
for f in os.listdir(get_object_type_path(sf_type))
if f.endswith(".csv")
]
# If the csv already exists, return the path
# This is likely due to a previous run that failed
# after downloading the csv but before the data was
# written to the db
if existing_csvs:
return existing_csvs
return None
def _build_bulk_query(sf_client: Salesforce, sf_type: str, time_filter: str) -> str:
queryable_fields = _get_all_queryable_fields_of_sf_type(sf_client, sf_type)
query = f"SELECT {', '.join(queryable_fields)} FROM {sf_type}{time_filter}"
@@ -133,16 +128,15 @@ def _build_bulk_query(sf_client: Salesforce, sf_type: str, time_filter: str) ->
def _bulk_retrieve_from_salesforce(
sf_client: Salesforce,
sf_type: str,
time_filter: str,
sf_client: Salesforce, sf_type: str, time_filter: str, target_dir: str
) -> tuple[str, list[str] | None]:
"""Returns a tuple of
1. the salesforce object type
2. the list of CSV's
"""
if not _check_if_object_type_is_empty(sf_client, sf_type, time_filter):
return sf_type, None
if existing_csvs := _check_for_existing_csvs(sf_type):
return sf_type, existing_csvs
query = _build_bulk_query(sf_client, sf_type, time_filter)
bulk_2_handler = SFBulk2Handler(
@@ -159,20 +153,33 @@ def _bulk_retrieve_from_salesforce(
)
logger.info(f"Downloading {sf_type}")
logger.info(f"Query: {query}")
logger.debug(f"Query: {query}")
try:
# This downloads the file to a file in the target path with a random name
results = bulk_2_type.download(
query=query,
path=get_object_type_path(sf_type),
path=target_dir,
max_records=1000000,
)
all_download_paths = [result["file"] for result in results]
# prepend each downloaded csv with the object type (delimiter = '.')
all_download_paths: list[str] = []
for result in results:
original_file_path = result["file"]
directory, filename = os.path.split(original_file_path)
new_filename = f"{sf_type}.{filename}"
new_file_path = os.path.join(directory, new_filename)
os.rename(original_file_path, new_file_path)
all_download_paths.append(new_file_path)
logger.info(f"Downloaded {sf_type} to {all_download_paths}")
return sf_type, all_download_paths
except Exception as e:
logger.info(f"Failed to download salesforce csv for object type {sf_type}: {e}")
logger.error(
f"Failed to download salesforce csv for object type {sf_type}: {e}"
)
logger.warning(f"Exceptioning query for object type {sf_type}: {query}")
return sf_type, None
@@ -181,12 +188,35 @@ def fetch_all_csvs_in_parallel(
object_types: set[str],
start: SecondsSinceUnixEpoch | None,
end: SecondsSinceUnixEpoch | None,
target_dir: str,
) -> dict[str, list[str] | None]:
"""
Fetches all the csvs in parallel for the given object types
Returns a dict of (sf_type, full_download_path)
"""
time_filter = _build_time_filter_for_salesforce(start, end)
# these types don't query properly and need looking at
# problem_types: set[str] = {
# "ContentDocumentLink",
# "RecordActionHistory",
# "PendingOrderSummary",
# "UnifiedActivityRelation",
# }
# these types don't have a LastModifiedDate field and instead use CreatedDate
created_date_types: set[str] = {
"AccountHistory",
"AccountTag",
"EntitySubscription",
}
last_modified_time_filter = _build_last_modified_time_filter_for_salesforce(
start, end
)
created_date_time_filter = _build_created_date_time_filter_for_salesforce(
start, end
)
time_filter_for_each_object_type = {}
# We do this outside of the thread pool executor because this requires
# a database connection and we don't want to block the thread pool
@@ -195,8 +225,11 @@ def fetch_all_csvs_in_parallel(
"""Only add time filter if there is at least one object of the type
in the database. We aren't worried about partially completed object update runs
because this occurs after we check for existing csvs which covers this case"""
if has_at_least_one_object_of_type(sf_type):
time_filter_for_each_object_type[sf_type] = time_filter
if has_at_least_one_object_of_type(target_dir, sf_type):
if sf_type in created_date_types:
time_filter_for_each_object_type[sf_type] = created_date_time_filter
else:
time_filter_for_each_object_type[sf_type] = last_modified_time_filter
else:
time_filter_for_each_object_type[sf_type] = ""
@@ -207,6 +240,7 @@ def fetch_all_csvs_in_parallel(
sf_client=sf_client,
sf_type=object_type,
time_filter=time_filter_for_each_object_type[object_type],
target_dir=target_dir,
),
object_types,
)

View File

@@ -2,8 +2,10 @@ import csv
import json
import os
import sqlite3
import time
from collections.abc import Iterator
from contextlib import contextmanager
from pathlib import Path
from onyx.connectors.salesforce.utils import get_sqlite_db_path
from onyx.connectors.salesforce.utils import SalesforceObject
@@ -16,6 +18,7 @@ logger = setup_logger()
@contextmanager
def get_db_connection(
directory: str,
isolation_level: str | None = None,
) -> Iterator[sqlite3.Connection]:
"""Get a database connection with proper isolation level and error handling.
@@ -25,7 +28,7 @@ def get_db_connection(
can be "IMMEDIATE" or "EXCLUSIVE" for more strict isolation.
"""
# 60 second timeout for locks
conn = sqlite3.connect(get_sqlite_db_path(), timeout=60.0)
conn = sqlite3.connect(get_sqlite_db_path(directory), timeout=60.0)
if isolation_level is not None:
conn.isolation_level = isolation_level
@@ -38,17 +41,41 @@ def get_db_connection(
conn.close()
def init_db() -> None:
def sqlite_log_stats(directory: str) -> None:
with get_db_connection(directory, "EXCLUSIVE") as conn:
cache_pages = conn.execute("PRAGMA cache_size").fetchone()[0]
page_size = conn.execute("PRAGMA page_size").fetchone()[0]
if cache_pages >= 0:
cache_bytes = cache_pages * page_size
else:
cache_bytes = abs(cache_pages * 1024)
logger.info(
f"SQLite stats: sqlite_version={sqlite3.sqlite_version} "
f"cache_pages={cache_pages} "
f"page_size={page_size} "
f"cache_bytes={cache_bytes}"
)
def init_db(directory: str) -> None:
"""Initialize the SQLite database with required tables if they don't exist."""
# Create database directory if it doesn't exist
os.makedirs(os.path.dirname(get_sqlite_db_path()), exist_ok=True)
start = time.monotonic()
with get_db_connection("EXCLUSIVE") as conn:
os.makedirs(os.path.dirname(get_sqlite_db_path(directory)), exist_ok=True)
with get_db_connection(directory, "EXCLUSIVE") as conn:
cursor = conn.cursor()
db_exists = os.path.exists(get_sqlite_db_path())
db_exists = os.path.exists(get_sqlite_db_path(directory))
if db_exists:
file_path = Path(get_sqlite_db_path(directory))
file_size = file_path.stat().st_size
logger.info(f"init_db - found existing sqlite db: len={file_size}")
else:
# why is this only if the db doesn't exist?
if not db_exists:
# Enable WAL mode for better concurrent access and write performance
cursor.execute("PRAGMA journal_mode=WAL")
cursor.execute("PRAGMA synchronous=NORMAL")
@@ -143,16 +170,31 @@ def init_db() -> None:
""",
)
elapsed = time.monotonic() - start
logger.info(f"init_db - create tables and indices: elapsed={elapsed:.2f}")
# Analyze tables to help query planner
cursor.execute("ANALYZE relationships")
cursor.execute("ANALYZE salesforce_objects")
cursor.execute("ANALYZE relationship_types")
cursor.execute("ANALYZE user_email_map")
# NOTE(rkuo): skip ANALYZE - it takes too long and we likely don't have
# complicated queries that need this
# start = time.monotonic()
# cursor.execute("ANALYZE relationships")
# cursor.execute("ANALYZE salesforce_objects")
# cursor.execute("ANALYZE relationship_types")
# cursor.execute("ANALYZE user_email_map")
# elapsed = time.monotonic() - start
# logger.info(f"init_db - analyze: elapsed={elapsed:.2f}")
# If database already existed but user_email_map needs to be populated
start = time.monotonic()
cursor.execute("SELECT COUNT(*) FROM user_email_map")
elapsed = time.monotonic() - start
logger.info(f"init_db - count user_email_map: elapsed={elapsed:.2f}")
start = time.monotonic()
if cursor.fetchone()[0] == 0:
_update_user_email_map(conn)
elapsed = time.monotonic() - start
logger.info(f"init_db - update_user_email_map: elapsed={elapsed:.2f}")
conn.commit()
@@ -240,15 +282,15 @@ def _update_user_email_map(conn: sqlite3.Connection) -> None:
def update_sf_db_with_csv(
directory: str,
object_type: str,
csv_download_path: str,
delete_csv_after_use: bool = True,
) -> list[str]:
"""Update the SF DB with a CSV file using SQLite storage."""
updated_ids = []
# Use IMMEDIATE to get a write lock at the start of the transaction
with get_db_connection("IMMEDIATE") as conn:
with get_db_connection(directory, "IMMEDIATE") as conn:
cursor = conn.cursor()
with open(csv_download_path, "r", newline="", encoding="utf-8") as f:
@@ -295,17 +337,12 @@ def update_sf_db_with_csv(
conn.commit()
if delete_csv_after_use:
# Remove the csv file after it has been used
# to successfully update the db
os.remove(csv_download_path)
return updated_ids
def get_child_ids(parent_id: str) -> set[str]:
def get_child_ids(directory: str, parent_id: str) -> set[str]:
"""Get all child IDs for a given parent ID."""
with get_db_connection() as conn:
with get_db_connection(directory) as conn:
cursor = conn.cursor()
# Force index usage with INDEXED BY
@@ -317,9 +354,9 @@ def get_child_ids(parent_id: str) -> set[str]:
return child_ids
def get_type_from_id(object_id: str) -> str | None:
def get_type_from_id(directory: str, object_id: str) -> str | None:
"""Get the type of an object from its ID."""
with get_db_connection() as conn:
with get_db_connection(directory) as conn:
cursor = conn.cursor()
cursor.execute(
"SELECT object_type FROM salesforce_objects WHERE id = ?", (object_id,)
@@ -332,15 +369,15 @@ def get_type_from_id(object_id: str) -> str | None:
def get_record(
object_id: str, object_type: str | None = None
directory: str, object_id: str, object_type: str | None = None
) -> SalesforceObject | None:
"""Retrieve the record and return it as a SalesforceObject."""
if object_type is None:
object_type = get_type_from_id(object_id)
object_type = get_type_from_id(directory, object_id)
if not object_type:
return None
with get_db_connection() as conn:
with get_db_connection(directory) as conn:
cursor = conn.cursor()
cursor.execute("SELECT data FROM salesforce_objects WHERE id = ?", (object_id,))
result = cursor.fetchone()
@@ -352,9 +389,9 @@ def get_record(
return SalesforceObject(id=object_id, type=object_type, data=data)
def find_ids_by_type(object_type: str) -> list[str]:
def find_ids_by_type(directory: str, object_type: str) -> list[str]:
"""Find all object IDs for rows of the specified type."""
with get_db_connection() as conn:
with get_db_connection(directory) as conn:
cursor = conn.cursor()
cursor.execute(
"SELECT id FROM salesforce_objects WHERE object_type = ?", (object_type,)
@@ -363,6 +400,7 @@ def find_ids_by_type(object_type: str) -> list[str]:
def get_affected_parent_ids_by_type(
directory: str,
updated_ids: list[str],
parent_types: list[str],
batch_size: int = 500,
@@ -374,7 +412,7 @@ def get_affected_parent_ids_by_type(
updated_ids_batches = batch_list(updated_ids, batch_size)
updated_parent_ids: set[str] = set()
with get_db_connection() as conn:
with get_db_connection(directory) as conn:
cursor = conn.cursor()
for batch_ids in updated_ids_batches:
@@ -419,7 +457,7 @@ def get_affected_parent_ids_by_type(
yield parent_type, new_affected_ids
def has_at_least_one_object_of_type(object_type: str) -> bool:
def has_at_least_one_object_of_type(directory: str, object_type: str) -> bool:
"""Check if there is at least one object of the specified type in the database.
Args:
@@ -428,7 +466,7 @@ def has_at_least_one_object_of_type(object_type: str) -> bool:
Returns:
bool: True if at least one object exists, False otherwise
"""
with get_db_connection() as conn:
with get_db_connection(directory) as conn:
cursor = conn.cursor()
cursor.execute(
"SELECT COUNT(*) FROM salesforce_objects WHERE object_type = ?",
@@ -443,7 +481,7 @@ def has_at_least_one_object_of_type(object_type: str) -> bool:
NULL_ID_STRING = "N/A"
def get_user_id_by_email(email: str) -> str | None:
def get_user_id_by_email(directory: str, email: str) -> str | None:
"""Get the Salesforce User ID for a given email address.
Args:
@@ -454,7 +492,7 @@ def get_user_id_by_email(email: str) -> str | None:
- was_found: True if the email exists in the table, False if not found
- user_id: The Salesforce User ID if exists, None otherwise
"""
with get_db_connection() as conn:
with get_db_connection(directory) as conn:
cursor = conn.cursor()
cursor.execute("SELECT user_id FROM user_email_map WHERE email = ?", (email,))
result = cursor.fetchone()
@@ -463,10 +501,10 @@ def get_user_id_by_email(email: str) -> str | None:
return result[0]
def update_email_to_id_table(email: str, id: str | None) -> None:
def update_email_to_id_table(directory: str, email: str, id: str | None) -> None:
"""Update the email to ID map table with a new email and ID."""
id_to_use = id or NULL_ID_STRING
with get_db_connection() as conn:
with get_db_connection(directory) as conn:
cursor = conn.cursor()
cursor.execute(
"INSERT OR REPLACE INTO user_email_map (email, user_id) VALUES (?, ?)",

View File

@@ -25,16 +25,14 @@ class SalesforceObject:
)
# te
# This defines the base path for all data files relative to this file
# AKA BE CAREFUL WHEN MOVING THIS FILE
BASE_DATA_PATH = os.path.join(os.path.dirname(__file__), "data")
def get_sqlite_db_path() -> str:
def get_sqlite_db_path(directory: str) -> str:
"""Get the path to the sqlite db file."""
return os.path.join(BASE_DATA_PATH, "salesforce_db.sqlite")
return os.path.join(directory, "salesforce_db.sqlite")
def get_object_type_path(object_type: str) -> str:

View File

@@ -255,7 +255,9 @@ _DISALLOWED_MSG_SUBTYPES = {
def default_msg_filter(message: MessageType) -> bool:
# Don't keep messages from bots
if message.get("bot_id") or message.get("app_id"):
if message.get("bot_profile", {}).get("name") == "OnyxConnector":
bot_profile_name = message.get("bot_profile", {}).get("name")
print(f"bot_profile_name: {bot_profile_name}")
if bot_profile_name == "DanswerBot Testing":
return False
return True

View File

@@ -5,11 +5,13 @@ from typing import cast
from sqlalchemy.orm import Session
from onyx.chat.models import ContextualPruningConfig
from onyx.chat.models import PromptConfig
from onyx.chat.models import SectionRelevancePiece
from onyx.chat.prune_and_merge import _merge_sections
from onyx.chat.prune_and_merge import ChunkRange
from onyx.chat.prune_and_merge import merge_chunk_intervals
from onyx.chat.prune_and_merge import prune_and_merge_sections
from onyx.configs.chat_configs import DISABLE_LLM_DOC_RELEVANCE
from onyx.context.search.enums import LLMEvaluationType
from onyx.context.search.enums import QueryFlow
@@ -61,6 +63,7 @@ class SearchPipeline:
| None = None,
rerank_metrics_callback: Callable[[RerankMetricsContainer], None] | None = None,
prompt_config: PromptConfig | None = None,
contextual_pruning_config: ContextualPruningConfig | None = None,
):
# NOTE: The Search Request contains a lot of fields that are overrides, many of them can be None
# and typically are None. The preprocessing will fetch default values to replace these empty overrides.
@@ -77,6 +80,9 @@ class SearchPipeline:
self.search_settings = get_current_search_settings(db_session)
self.document_index = get_default_document_index(self.search_settings, None)
self.prompt_config: PromptConfig | None = prompt_config
self.contextual_pruning_config: ContextualPruningConfig | None = (
contextual_pruning_config
)
# Preprocessing steps generate this
self._search_query: SearchQuery | None = None
@@ -221,7 +227,7 @@ class SearchPipeline:
# If ee is enabled, censor the chunk sections based on user access
# Otherwise, return the retrieved chunks
censored_chunks = fetch_ee_implementation_or_noop(
censored_chunks: list[InferenceChunk] = fetch_ee_implementation_or_noop(
"onyx.external_permissions.post_query_censoring",
"_post_query_chunk_censoring",
retrieved_chunks,
@@ -420,7 +426,26 @@ class SearchPipeline:
if self._final_context_sections is not None:
return self._final_context_sections
self._final_context_sections = _merge_sections(sections=self.reranked_sections)
if (
self.contextual_pruning_config is not None
and self.prompt_config is not None
):
self._final_context_sections = prune_and_merge_sections(
sections=self.reranked_sections,
section_relevance_list=None,
prompt_config=self.prompt_config,
llm_config=self.llm.config,
question=self.search_query.query,
contextual_pruning_config=self.contextual_pruning_config,
)
else:
logger.error(
"Contextual pruning or prompt config not set, using default merge"
)
self._final_context_sections = _merge_sections(
sections=self.reranked_sections
)
return self._final_context_sections
@property

View File

@@ -613,8 +613,19 @@ def fetch_connector_credential_pairs(
def resync_cc_pair(
cc_pair: ConnectorCredentialPair,
search_settings_id: int,
db_session: Session,
) -> None:
"""
Updates state stored in the connector_credential_pair table based on the
latest index attempt for the given search settings.
Args:
cc_pair: ConnectorCredentialPair to resync
search_settings_id: SearchSettings to use for resync
db_session: Database session
"""
def find_latest_index_attempt(
connector_id: int,
credential_id: int,
@@ -627,11 +638,10 @@ def resync_cc_pair(
ConnectorCredentialPair,
IndexAttempt.connector_credential_pair_id == ConnectorCredentialPair.id,
)
.join(SearchSettings, IndexAttempt.search_settings_id == SearchSettings.id)
.filter(
ConnectorCredentialPair.connector_id == connector_id,
ConnectorCredentialPair.credential_id == credential_id,
SearchSettings.status == IndexModelStatus.PRESENT,
IndexAttempt.search_settings_id == search_settings_id,
)
)

View File

@@ -43,6 +43,8 @@ from onyx.utils.logger import setup_logger
logger = setup_logger()
ONE_HOUR_IN_SECONDS = 60 * 60
def check_docs_exist(db_session: Session) -> bool:
stmt = select(exists(DbDocument))
@@ -607,6 +609,46 @@ def delete_documents_complete__no_commit(
delete_documents__no_commit(db_session, document_ids)
def delete_all_documents_for_connector_credential_pair(
db_session: Session,
connector_id: int,
credential_id: int,
timeout: int = ONE_HOUR_IN_SECONDS,
) -> None:
"""Delete all documents for a given connector credential pair.
This will delete all documents and their associated data (chunks, feedback, tags, etc.)
NOTE: a bit inefficient, but it's not a big deal since this is done rarely - only during
an index swap. If we wanted to make this more efficient, we could use a single delete
statement + cascade.
"""
batch_size = 1000
start_time = time.monotonic()
while True:
# Get document IDs in batches
stmt = (
select(DocumentByConnectorCredentialPair.id)
.where(
DocumentByConnectorCredentialPair.connector_id == connector_id,
DocumentByConnectorCredentialPair.credential_id == credential_id,
)
.limit(batch_size)
)
document_ids = db_session.scalars(stmt).all()
if not document_ids:
break
delete_documents_complete__no_commit(
db_session=db_session, document_ids=list(document_ids)
)
db_session.commit()
if time.monotonic() - start_time > timeout:
raise RuntimeError("Timeout reached while deleting documents")
def acquire_document_locks(db_session: Session, document_ids: list[str]) -> bool:
"""Acquire locks for the specified documents. Ideally this shouldn't be
called with large list of document_ids (an exception could be made if the

View File

@@ -710,6 +710,25 @@ def cancel_indexing_attempts_past_model(
)
def cancel_indexing_attempts_for_search_settings(
search_settings_id: int,
db_session: Session,
) -> None:
"""Stops all indexing attempts that are in progress or not started for
the specified search settings."""
db_session.execute(
update(IndexAttempt)
.where(
IndexAttempt.status.in_(
[IndexingStatus.IN_PROGRESS, IndexingStatus.NOT_STARTED]
),
IndexAttempt.search_settings_id == search_settings_id,
)
.values(status=IndexingStatus.FAILED)
)
def count_unique_cc_pairs_with_successful_index_attempts(
search_settings_id: int | None,
db_session: Session,

View File

@@ -703,7 +703,11 @@ class Connector(Base):
)
documents_by_connector: Mapped[
list["DocumentByConnectorCredentialPair"]
] = relationship("DocumentByConnectorCredentialPair", back_populates="connector")
] = relationship(
"DocumentByConnectorCredentialPair",
back_populates="connector",
passive_deletes=True,
)
# synchronize this validation logic with RefreshFrequencySchema etc on front end
# until we have a centralized validation schema
@@ -757,7 +761,11 @@ class Credential(Base):
)
documents_by_credential: Mapped[
list["DocumentByConnectorCredentialPair"]
] = relationship("DocumentByConnectorCredentialPair", back_populates="credential")
] = relationship(
"DocumentByConnectorCredentialPair",
back_populates="credential",
passive_deletes=True,
)
user: Mapped[User | None] = relationship("User", back_populates="credentials")
@@ -1110,10 +1118,10 @@ class DocumentByConnectorCredentialPair(Base):
id: Mapped[str] = mapped_column(ForeignKey("document.id"), primary_key=True)
# TODO: transition this to use the ConnectorCredentialPair id directly
connector_id: Mapped[int] = mapped_column(
ForeignKey("connector.id"), primary_key=True
ForeignKey("connector.id", ondelete="CASCADE"), primary_key=True
)
credential_id: Mapped[int] = mapped_column(
ForeignKey("credential.id"), primary_key=True
ForeignKey("credential.id", ondelete="CASCADE"), primary_key=True
)
# used to better keep track of document counts at a connector level
@@ -1123,10 +1131,10 @@ class DocumentByConnectorCredentialPair(Base):
has_been_indexed: Mapped[bool] = mapped_column(Boolean)
connector: Mapped[Connector] = relationship(
"Connector", back_populates="documents_by_connector"
"Connector", back_populates="documents_by_connector", passive_deletes=True
)
credential: Mapped[Credential] = relationship(
"Credential", back_populates="documents_by_credential"
"Credential", back_populates="documents_by_credential", passive_deletes=True
)
__table_args__ = (
@@ -1650,8 +1658,8 @@ class Prompt(Base):
)
name: Mapped[str] = mapped_column(String)
description: Mapped[str] = mapped_column(String)
system_prompt: Mapped[str] = mapped_column(Text)
task_prompt: Mapped[str] = mapped_column(Text)
system_prompt: Mapped[str] = mapped_column(String(length=8000))
task_prompt: Mapped[str] = mapped_column(String(length=8000))
include_citations: Mapped[bool] = mapped_column(Boolean, default=True)
datetime_aware: Mapped[bool] = mapped_column(Boolean, default=True)
# Default prompts are configured via backend during deployment

View File

@@ -37,8 +37,8 @@ from onyx.db.models import UserFile
from onyx.db.models import UserFolder
from onyx.db.models import UserGroup
from onyx.db.notification import create_notification
from onyx.server.features.persona.models import FullPersonaSnapshot
from onyx.server.features.persona.models import PersonaSharedNotificationData
from onyx.server.features.persona.models import PersonaSnapshot
from onyx.server.features.persona.models import PersonaUpsertRequest
from onyx.utils.logger import setup_logger
from onyx.utils.variable_functionality import fetch_versioned_implementation
@@ -201,7 +201,7 @@ def create_update_persona(
create_persona_request: PersonaUpsertRequest,
user: User | None,
db_session: Session,
) -> PersonaSnapshot:
) -> FullPersonaSnapshot:
"""Higher level function than upsert_persona, although either is valid to use."""
# Permission to actually use these is checked later
@@ -271,7 +271,7 @@ def create_update_persona(
logger.exception("Failed to create persona")
raise HTTPException(status_code=400, detail=str(e))
return PersonaSnapshot.from_model(persona)
return FullPersonaSnapshot.from_model(persona)
def update_persona_shared_users(

View File

@@ -3,8 +3,9 @@ from sqlalchemy.orm import Session
from onyx.configs.constants import KV_REINDEX_KEY
from onyx.db.connector_credential_pair import get_connector_credential_pairs
from onyx.db.connector_credential_pair import resync_cc_pair
from onyx.db.document import delete_all_documents_for_connector_credential_pair
from onyx.db.enums import IndexModelStatus
from onyx.db.index_attempt import cancel_indexing_attempts_past_model
from onyx.db.index_attempt import cancel_indexing_attempts_for_search_settings
from onyx.db.index_attempt import (
count_unique_cc_pairs_with_successful_index_attempts,
)
@@ -26,31 +27,49 @@ def _perform_index_swap(
current_search_settings: SearchSettings,
secondary_search_settings: SearchSettings,
all_cc_pairs: list[ConnectorCredentialPair],
cleanup_documents: bool = False,
) -> None:
"""Swap the indices and expire the old one."""
current_search_settings = get_current_search_settings(db_session)
update_search_settings_status(
search_settings=current_search_settings,
new_status=IndexModelStatus.PAST,
db_session=db_session,
)
update_search_settings_status(
search_settings=secondary_search_settings,
new_status=IndexModelStatus.PRESENT,
db_session=db_session,
)
if len(all_cc_pairs) > 0:
kv_store = get_kv_store()
kv_store.store(KV_REINDEX_KEY, False)
# Expire jobs for the now past index/embedding model
cancel_indexing_attempts_past_model(db_session)
cancel_indexing_attempts_for_search_settings(
search_settings_id=current_search_settings.id,
db_session=db_session,
)
# Recount aggregates
for cc_pair in all_cc_pairs:
resync_cc_pair(cc_pair, db_session=db_session)
resync_cc_pair(
cc_pair=cc_pair,
# sync based on the new search settings
search_settings_id=secondary_search_settings.id,
db_session=db_session,
)
if cleanup_documents:
# clean up all DocumentByConnectorCredentialPair / Document rows, since we're
# doing an instant swap and no documents will exist in the new index.
for cc_pair in all_cc_pairs:
delete_all_documents_for_connector_credential_pair(
db_session=db_session,
connector_id=cc_pair.connector_id,
credential_id=cc_pair.credential_id,
)
# swap over search settings
update_search_settings_status(
search_settings=current_search_settings,
new_status=IndexModelStatus.PAST,
db_session=db_session,
)
update_search_settings_status(
search_settings=secondary_search_settings,
new_status=IndexModelStatus.PRESENT,
db_session=db_session,
)
# remove the old index from the vector db
document_index = get_default_document_index(secondary_search_settings, None)
@@ -88,6 +107,9 @@ def check_and_perform_index_swap(db_session: Session) -> SearchSettings | None:
current_search_settings=current_search_settings,
secondary_search_settings=secondary_search_settings,
all_cc_pairs=all_cc_pairs,
# clean up all DocumentByConnectorCredentialPair / Document rows, since we're
# doing an instant swap.
cleanup_documents=True,
)
return current_search_settings

View File

@@ -2,6 +2,7 @@ import io
import json
import os
import re
import uuid
import zipfile
from collections.abc import Callable
from collections.abc import Iterator
@@ -14,6 +15,7 @@ from pathlib import Path
from typing import Any
from typing import IO
from typing import NamedTuple
from typing import Optional
import chardet
import docx # type: ignore
@@ -568,8 +570,8 @@ def extract_text_and_images(
def convert_docx_to_txt(
file: UploadFile, file_store: FileStore, file_path: str
) -> None:
file: UploadFile, file_store: FileStore, file_path: Optional[str] = None
) -> str:
"""
Helper to convert docx to a .txt file in the same filestore.
"""
@@ -581,15 +583,41 @@ def convert_docx_to_txt(
all_paras = [p.text for p in doc.paragraphs]
text_content = "\n".join(all_paras)
txt_file_path = docx_to_txt_filename(file_path)
file_name = file.filename or f"docx_{uuid.uuid4()}"
text_file_name = docx_to_txt_filename(file_path if file_path else file_name)
file_store.save_file(
file_name=txt_file_path,
file_name=text_file_name,
content=BytesIO(text_content.encode("utf-8")),
display_name=file.filename,
file_origin=FileOrigin.CONNECTOR,
file_type="text/plain",
)
return text_file_name
def docx_to_txt_filename(file_path: str) -> str:
return file_path.rsplit(".", 1)[0] + ".txt"
def convert_pdf_to_txt(file: UploadFile, file_store: FileStore, file_path: str) -> str:
"""
Helper to convert PDF to a .txt file in the same filestore.
"""
file.file.seek(0)
# Extract text from the PDF
text_content, _, _ = read_pdf_file(file.file)
text_file_name = pdf_to_txt_filename(file_path)
file_store.save_file(
file_name=text_file_name,
content=BytesIO(text_content.encode("utf-8")),
display_name=file.filename,
file_origin=FileOrigin.CONNECTOR,
file_type="text/plain",
)
return text_file_name
def pdf_to_txt_filename(file_path: str) -> str:
return file_path.rsplit(".", 1)[0] + ".txt"

View File

@@ -459,10 +459,6 @@ def process_image_sections(documents: list[Document]) -> list[IndexingDocument]:
llm = get_default_llm_with_vision()
if not llm:
logger.warning(
"No vision-capable LLM available. Image sections will not be processed."
)
# Even without LLM, we still convert to IndexingDocument with base Sections
return [
IndexingDocument(
@@ -929,10 +925,12 @@ def index_doc_batch(
for chunk_num, chunk in enumerate(chunks_with_embeddings)
]
logger.debug(
"Indexing the following chunks: "
f"{[chunk.to_short_descriptor() for chunk in access_aware_chunks]}"
)
short_descriptor_list = [
chunk.to_short_descriptor() for chunk in access_aware_chunks
]
short_descriptor_log = str(short_descriptor_list)[:1024]
logger.debug(f"Indexing the following chunks: {short_descriptor_log}")
# A document will not be spread across different batches, so all the
# documents with chunks in this set, are fully represented by the chunks
# in this set

View File

@@ -602,7 +602,7 @@ def get_max_input_tokens(
)
if input_toks <= 0:
raise RuntimeError("No tokens for input for the LLM given settings")
return GEN_AI_MODEL_FALLBACK_MAX_TOKENS
return input_toks

View File

@@ -1,3 +1,4 @@
import logging
import sys
import traceback
from collections.abc import AsyncGenerator
@@ -16,6 +17,7 @@ from fastapi.exceptions import RequestValidationError
from fastapi.middleware.cors import CORSMiddleware
from fastapi.responses import JSONResponse
from httpx_oauth.clients.google import GoogleOAuth2
from prometheus_fastapi_instrumentator import Instrumentator
from sentry_sdk.integrations.fastapi import FastApiIntegration
from sentry_sdk.integrations.starlette import StarletteIntegration
from sqlalchemy.orm import Session
@@ -102,6 +104,8 @@ from onyx.server.utils import BasicAuthenticationError
from onyx.setup import setup_multitenant_onyx
from onyx.setup import setup_onyx
from onyx.utils.logger import setup_logger
from onyx.utils.logger import setup_uvicorn_logger
from onyx.utils.middleware import add_onyx_request_id_middleware
from onyx.utils.telemetry import get_or_generate_uuid
from onyx.utils.telemetry import optional_telemetry
from onyx.utils.telemetry import RecordType
@@ -116,6 +120,12 @@ from shared_configs.contextvars import CURRENT_TENANT_ID_CONTEXTVAR
logger = setup_logger()
file_handlers = [
h for h in logger.logger.handlers if isinstance(h, logging.FileHandler)
]
setup_uvicorn_logger(shared_file_handlers=file_handlers)
def validation_exception_handler(request: Request, exc: Exception) -> JSONResponse:
if not isinstance(exc, RequestValidationError):
@@ -421,9 +431,14 @@ def get_application() -> FastAPI:
if LOG_ENDPOINT_LATENCY:
add_latency_logging_middleware(application, logger)
add_onyx_request_id_middleware(application, "API", logger)
# Ensure all routes have auth enabled or are explicitly marked as public
check_router_auth(application)
# Initialize and instrument the app
Instrumentator().instrument(application).expose(application)
return application

View File

@@ -175,7 +175,7 @@ class EmbeddingModel:
embeddings: list[Embedding] = []
def process_batch(
batch_idx: int, text_batch: list[str]
batch_idx: int, batch_len: int, text_batch: list[str]
) -> tuple[int, list[Embedding]]:
if self.callback:
if self.callback.should_stop():
@@ -202,8 +202,8 @@ class EmbeddingModel:
end_time = time.time()
processing_time = end_time - start_time
logger.info(
f"Batch {batch_idx} processing time: {processing_time:.2f} seconds"
logger.debug(
f"EmbeddingModel.process_batch: Batch {batch_idx}/{batch_len} processing time: {processing_time:.2f} seconds"
)
return batch_idx, response.embeddings
@@ -215,7 +215,7 @@ class EmbeddingModel:
if num_threads >= 1 and self.provider_type and len(text_batches) > 1:
with ThreadPoolExecutor(max_workers=num_threads) as executor:
future_to_batch = {
executor.submit(process_batch, idx, batch): idx
executor.submit(process_batch, idx, len(text_batches), batch): idx
for idx, batch in enumerate(text_batches, start=1)
}
@@ -238,7 +238,7 @@ class EmbeddingModel:
else:
# Original sequential processing
for idx, text_batch in enumerate(text_batches, start=1):
_, batch_embeddings = process_batch(idx, text_batch)
_, batch_embeddings = process_batch(idx, len(text_batches), text_batch)
embeddings.extend(batch_embeddings)
if self.callback:
self.callback.progress("_batch_encode_texts", 1)

View File

@@ -0,0 +1,147 @@
# Standards
SEPARATOR_LINE = "-------"
SEPARATOR_LINE_LONG = "---------------"
NO_EXTRACTION = "No extraction of knowledge graph objects was feasable."
YES = "yes"
NO = "no"
DC_OBJECT_SEPARATOR = ";"
DC_OBJECT_NO_BASE_DATA_EXTRACTION_PROMPT = f"""
You are an expert in finding relevant objects/objext specifications of the same type in a list of documents. \
In this case you are interested \
in generating: {{objects_of_interest}}.
You should look at the documents - in no particular order! - and extract each object you find in the documents.
{SEPARATOR_LINE}
Here are the documents you are supposed to search through:
--
{{document_text}}
{SEPARATOR_LINE}
Here are the task instructions you should use to help you find the desired objects:
{SEPARATOR_LINE}
{{task}}
{SEPARATOR_LINE}
Here is the question that may provide critical additional context for the task:
{SEPARATOR_LINE}
{{question}}
{SEPARATOR_LINE}
Please answer the question in the following format:
REASONING: <your reasoning for the classification> - OBJECTS: <the objects - just their names - that you found, \
separated by ';'>
""".strip()
DC_OBJECT_WITH_BASE_DATA_EXTRACTION_PROMPT = f"""
You are an expert in finding relevant objects/object specifications of the same type in a list of documents. \
In this case you are interested \
in generating: {{objects_of_interest}}.
You should look at the provided data - in no particular order! - and extract each object you find in the documents.
{SEPARATOR_LINE}
Here are the data provided by the user:
--
{{base_data}}
{SEPARATOR_LINE}
Here are the task instructions you should use to help you find the desired objects:
{SEPARATOR_LINE}
{{task}}
{SEPARATOR_LINE}
Here is the request that may provide critical additional context for the task:
{SEPARATOR_LINE}
{{question}}
{SEPARATOR_LINE}
Please address the request in the following format:
REASONING: <your reasoning for the classification> - OBJECTS: <the objects - just their names - that you found, \
separated by ';'>
""".strip()
DC_OBJECT_SOURCE_RESEARCH_PROMPT = f"""
Today is {{today}}. You are an expert in extracting relevant structured information from a list of documents that \
should relate to one object. (Try to make sure that you know it relates to that one object!).
You should look at the documents - in no particular order! - and extract the information asked for this task:
{SEPARATOR_LINE}
{{task}}
{SEPARATOR_LINE}
Here is the user question that may provide critical additional context for the task:
{SEPARATOR_LINE}
{{question}}
{SEPARATOR_LINE}
Here are the documents you are supposed to search through:
--
{{document_text}}
{SEPARATOR_LINE}
Note: please cite your sources inline as you generate the results! Use the format [1], etc. Infer the \
number from the provided context documents. This is very important!
Please address the task in the following format:
REASONING:
-- <your reasoning for the classification>
RESEARCH RESULTS:
{{format}}
""".strip()
DC_OBJECT_CONSOLIDATION_PROMPT = f"""
You are a helpful assistant that consolidates information about a specific object \
from multiple sources.
The object is:
{SEPARATOR_LINE}
{{object}}
{SEPARATOR_LINE}
and the information is
{SEPARATOR_LINE}
{{information}}
{SEPARATOR_LINE}
Here is the user question that may provide critical additional context for the task:
{SEPARATOR_LINE}
{{question}}
{SEPARATOR_LINE}
Please consolidate the information into a single, concise answer. The consolidated informtation \
for the object should be in the following format:
{SEPARATOR_LINE}
{{format}}
{SEPARATOR_LINE}
Overall, please use this structure to communicate the consolidated information:
{SEPARATOR_LINE}
REASONING: <your reasoning for consolidating the information>
INFORMATION:
<consolidated information in the proper format that you have created>
"""
DC_FORMATTING_NO_BASE_DATA_PROMPT = f"""
You are an expert in text formatting. Your task is to take a given text and convert it 100 percent accurately \
in a new format.
Here is the text you are supposed to format:
{SEPARATOR_LINE}
{{text}}
{SEPARATOR_LINE}
Here is the format you are supposed to use:
{SEPARATOR_LINE}
{{format}}
{SEPARATOR_LINE}
Please start the generation directly with the formatted text. (Note that the output should not be code, but text.)
"""
DC_FORMATTING_WITH_BASE_DATA_PROMPT = f"""
You are an expert in text formatting. Your task is to take a given text and the initial \
base data provided by the user, and convert it 100 percent accurately \
in a new format. The base data may also contain important relationships that are critical \
for the formatting.
Here is the initial data provided by the user:
{SEPARATOR_LINE}
{{base_data}}
{SEPARATOR_LINE}
Here is the text you are supposed combine (and format) with the initial data, adhering to the \
format instructions provided by later in the prompt:
{SEPARATOR_LINE}
{{text}}
{SEPARATOR_LINE}
And here are the format instructions you are supposed to use:
{SEPARATOR_LINE}
{{format}}
{SEPARATOR_LINE}
Please start the generation directly with the formatted text. (Note that the output should not be code, but text.)
"""

View File

@@ -49,6 +49,7 @@ PUBLIC_ENDPOINT_SPECS = [
("/auth/oauth/callback", {"GET"}),
# anonymous user on cloud
("/tenants/anonymous-user", {"POST"}),
("/metrics", {"GET"}), # added by prometheus_fastapi_instrumentator
]

View File

@@ -21,7 +21,7 @@ from onyx.background.celery.tasks.external_group_syncing.tasks import (
from onyx.background.celery.tasks.pruning.tasks import (
try_creating_prune_generator_task,
)
from onyx.background.celery.versioned_apps.primary import app as primary_app
from onyx.background.celery.versioned_apps.client import app as client_app
from onyx.background.indexing.models import IndexAttemptErrorPydantic
from onyx.configs.constants import OnyxCeleryPriority
from onyx.configs.constants import OnyxCeleryTask
@@ -219,7 +219,7 @@ def update_cc_pair_status(
continue
# Revoke the task to prevent it from running
primary_app.control.revoke(index_payload.celery_task_id)
client_app.control.revoke(index_payload.celery_task_id)
# If it is running, then signaling for termination will get the
# watchdog thread to kill the spawned task
@@ -238,7 +238,7 @@ def update_cc_pair_status(
db_session.commit()
# this speeds up the start of indexing by firing the check immediately
primary_app.send_task(
client_app.send_task(
OnyxCeleryTask.CHECK_FOR_INDEXING,
kwargs=dict(tenant_id=tenant_id),
priority=OnyxCeleryPriority.HIGH,
@@ -376,7 +376,7 @@ def prune_cc_pair(
f"{cc_pair.connector.name} connector."
)
payload_id = try_creating_prune_generator_task(
primary_app, cc_pair, db_session, r, tenant_id
client_app, cc_pair, db_session, r, tenant_id
)
if not payload_id:
raise HTTPException(
@@ -450,7 +450,7 @@ def sync_cc_pair(
f"{cc_pair.connector.name} connector."
)
payload_id = try_creating_permissions_sync_task(
primary_app, cc_pair_id, r, tenant_id
client_app, cc_pair_id, r, tenant_id
)
if not payload_id:
raise HTTPException(
@@ -524,7 +524,7 @@ def sync_cc_pair_groups(
f"{cc_pair.connector.name} connector."
)
payload_id = try_creating_external_group_sync_task(
primary_app, cc_pair_id, r, tenant_id
client_app, cc_pair_id, r, tenant_id
)
if not payload_id:
raise HTTPException(
@@ -634,7 +634,7 @@ def associate_credential_to_connector(
)
# trigger indexing immediately
primary_app.send_task(
client_app.send_task(
OnyxCeleryTask.CHECK_FOR_INDEXING,
priority=OnyxCeleryPriority.HIGH,
kwargs={"tenant_id": tenant_id},

View File

@@ -20,7 +20,7 @@ from onyx.auth.users import current_admin_user
from onyx.auth.users import current_chat_accessible_user
from onyx.auth.users import current_curator_or_admin_user
from onyx.auth.users import current_user
from onyx.background.celery.versioned_apps.primary import app as primary_app
from onyx.background.celery.versioned_apps.client import app as client_app
from onyx.configs.app_configs import ENABLED_CONNECTOR_TYPES
from onyx.configs.app_configs import MOCK_CONNECTOR_FILE_PATH
from onyx.configs.constants import DocumentSource
@@ -100,6 +100,7 @@ from onyx.db.models import UserGroup__ConnectorCredentialPair
from onyx.db.search_settings import get_current_search_settings
from onyx.db.search_settings import get_secondary_search_settings
from onyx.file_processing.extract_file_text import convert_docx_to_txt
from onyx.file_processing.extract_file_text import convert_pdf_to_txt
from onyx.file_store.file_store import get_default_file_store
from onyx.key_value_store.interface import KvKeyNotFoundError
from onyx.redis.redis_connector import RedisConnector
@@ -128,6 +129,7 @@ from onyx.utils.telemetry import create_milestone_and_report
from onyx.utils.threadpool_concurrency import run_functions_tuples_in_parallel
from onyx.utils.variable_functionality import fetch_ee_implementation_or_noop
logger = setup_logger()
_GMAIL_CREDENTIAL_ID_COOKIE_NAME = "gmail_credential_id"
@@ -430,6 +432,23 @@ def upload_files(files: list[UploadFile], db_session: Session) -> FileUploadResp
)
continue
# Special handling for docx files - only store the plaintext version
if file.content_type and file.content_type.startswith(
"application/vnd.openxmlformats-officedocument.wordprocessingml.document"
):
file_path = os.path.join(str(uuid.uuid4()), cast(str, file.filename))
text_file_path = convert_docx_to_txt(file, file_store)
deduped_file_paths.append(text_file_path)
continue
# Special handling for PDF files - only store the plaintext version
if file.content_type and file.content_type.startswith("application/pdf"):
file_path = os.path.join(str(uuid.uuid4()), cast(str, file.filename))
text_file_path = convert_pdf_to_txt(file, file_store, file_path)
deduped_file_paths.append(text_file_path)
continue
# Default handling for all other file types
file_path = os.path.join(str(uuid.uuid4()), cast(str, file.filename))
deduped_file_paths.append(file_path)
file_store.save_file(
@@ -440,11 +459,6 @@ def upload_files(files: list[UploadFile], db_session: Session) -> FileUploadResp
file_type=file.content_type or "text/plain",
)
if file.content_type and file.content_type.startswith(
"application/vnd.openxmlformats-officedocument.wordprocessingml.document"
):
convert_docx_to_txt(file, file_store, file_path)
except ValueError as e:
raise HTTPException(status_code=400, detail=str(e))
return FileUploadResponse(file_paths=deduped_file_paths)
@@ -928,7 +942,7 @@ def create_connector_with_mock_credential(
)
# trigger indexing immediately
primary_app.send_task(
client_app.send_task(
OnyxCeleryTask.CHECK_FOR_INDEXING,
priority=OnyxCeleryPriority.HIGH,
kwargs={"tenant_id": tenant_id},
@@ -1314,7 +1328,7 @@ def trigger_indexing_for_cc_pair(
# run the beat task to pick up the triggers immediately
priority = OnyxCeleryPriority.HIGHEST if is_user_file else OnyxCeleryPriority.HIGH
logger.info(f"Sending indexing check task with priority {priority}")
primary_app.send_task(
client_app.send_task(
OnyxCeleryTask.CHECK_FOR_INDEXING,
priority=priority,
kwargs={"tenant_id": tenant_id},

View File

@@ -6,7 +6,7 @@ from sqlalchemy.orm import Session
from onyx.auth.users import current_curator_or_admin_user
from onyx.auth.users import current_user
from onyx.background.celery.versioned_apps.primary import app as primary_app
from onyx.background.celery.versioned_apps.client import app as client_app
from onyx.configs.constants import OnyxCeleryPriority
from onyx.configs.constants import OnyxCeleryTask
from onyx.db.document_set import check_document_sets_are_public
@@ -52,7 +52,7 @@ def create_document_set(
except Exception as e:
raise HTTPException(status_code=400, detail=str(e))
primary_app.send_task(
client_app.send_task(
OnyxCeleryTask.CHECK_FOR_VESPA_SYNC_TASK,
kwargs={"tenant_id": tenant_id},
priority=OnyxCeleryPriority.HIGH,
@@ -85,7 +85,7 @@ def patch_document_set(
except Exception as e:
raise HTTPException(status_code=400, detail=str(e))
primary_app.send_task(
client_app.send_task(
OnyxCeleryTask.CHECK_FOR_VESPA_SYNC_TASK,
kwargs={"tenant_id": tenant_id},
priority=OnyxCeleryPriority.HIGH,
@@ -108,7 +108,7 @@ def delete_document_set(
except Exception as e:
raise HTTPException(status_code=400, detail=str(e))
primary_app.send_task(
client_app.send_task(
OnyxCeleryTask.CHECK_FOR_VESPA_SYNC_TASK,
kwargs={"tenant_id": tenant_id},
priority=OnyxCeleryPriority.HIGH,

View File

@@ -43,6 +43,7 @@ from onyx.file_store.models import ChatFileType
from onyx.secondary_llm_flows.starter_message_creation import (
generate_starter_messages,
)
from onyx.server.features.persona.models import FullPersonaSnapshot
from onyx.server.features.persona.models import GenerateStarterMessageRequest
from onyx.server.features.persona.models import ImageGenerationToolStatus
from onyx.server.features.persona.models import PersonaLabelCreate
@@ -424,8 +425,8 @@ def get_persona(
persona_id: int,
user: User | None = Depends(current_limited_user),
db_session: Session = Depends(get_session),
) -> PersonaSnapshot:
return PersonaSnapshot.from_model(
) -> FullPersonaSnapshot:
return FullPersonaSnapshot.from_model(
get_persona_by_id(
persona_id=persona_id,
user=user,

View File

@@ -91,37 +91,80 @@ class PersonaUpsertRequest(BaseModel):
class PersonaSnapshot(BaseModel):
id: int
owner: MinimalUserSnapshot | None
name: str
is_visible: bool
is_public: bool
display_priority: int | None
description: str
num_chunks: float | None
llm_relevance_filter: bool
llm_filter_extraction: bool
llm_model_provider_override: str | None
llm_model_version_override: str | None
starter_messages: list[StarterMessage] | None
builtin_persona: bool
prompts: list[PromptSnapshot]
tools: list[ToolSnapshot]
document_sets: list[DocumentSet]
users: list[MinimalUserSnapshot]
groups: list[int]
icon_color: str | None
icon_shape: int | None
is_public: bool
is_visible: bool
icon_shape: int | None = None
icon_color: str | None = None
uploaded_image_id: str | None = None
is_default_persona: bool
user_file_ids: list[int] = Field(default_factory=list)
user_folder_ids: list[int] = Field(default_factory=list)
display_priority: int | None = None
is_default_persona: bool = False
builtin_persona: bool = False
starter_messages: list[StarterMessage] | None = None
tools: list[ToolSnapshot] = Field(default_factory=list)
labels: list["PersonaLabelSnapshot"] = Field(default_factory=list)
owner: MinimalUserSnapshot | None = None
users: list[MinimalUserSnapshot] = Field(default_factory=list)
groups: list[int] = Field(default_factory=list)
document_sets: list[DocumentSet] = Field(default_factory=list)
llm_model_provider_override: str | None = None
llm_model_version_override: str | None = None
num_chunks: float | None = None
@classmethod
def from_model(cls, persona: Persona) -> "PersonaSnapshot":
return PersonaSnapshot(
id=persona.id,
name=persona.name,
description=persona.description,
is_public=persona.is_public,
is_visible=persona.is_visible,
icon_shape=persona.icon_shape,
icon_color=persona.icon_color,
uploaded_image_id=persona.uploaded_image_id,
user_file_ids=[file.id for file in persona.user_files],
user_folder_ids=[folder.id for folder in persona.user_folders],
display_priority=persona.display_priority,
is_default_persona=persona.is_default_persona,
builtin_persona=persona.builtin_persona,
starter_messages=persona.starter_messages,
tools=[ToolSnapshot.from_model(tool) for tool in persona.tools],
labels=[PersonaLabelSnapshot.from_model(label) for label in persona.labels],
owner=(
MinimalUserSnapshot(id=persona.user.id, email=persona.user.email)
if persona.user
else None
),
users=[
MinimalUserSnapshot(id=user.id, email=user.email)
for user in persona.users
],
groups=[user_group.id for user_group in persona.groups],
document_sets=[
DocumentSet.from_model(document_set_model)
for document_set_model in persona.document_sets
],
llm_model_provider_override=persona.llm_model_provider_override,
llm_model_version_override=persona.llm_model_version_override,
num_chunks=persona.num_chunks,
)
# Model with full context on perona's internal settings
# This is used for flows which need to know all settings
class FullPersonaSnapshot(PersonaSnapshot):
search_start_date: datetime | None = None
labels: list["PersonaLabelSnapshot"] = []
user_file_ids: list[int] | None = None
user_folder_ids: list[int] | None = None
prompts: list[PromptSnapshot] = Field(default_factory=list)
llm_relevance_filter: bool = False
llm_filter_extraction: bool = False
@classmethod
def from_model(
cls, persona: Persona, allow_deleted: bool = False
) -> "PersonaSnapshot":
) -> "FullPersonaSnapshot":
if persona.deleted:
error_msg = f"Persona with ID {persona.id} has been deleted"
if not allow_deleted:
@@ -129,44 +172,32 @@ class PersonaSnapshot(BaseModel):
else:
logger.warning(error_msg)
return PersonaSnapshot(
return FullPersonaSnapshot(
id=persona.id,
name=persona.name,
description=persona.description,
is_public=persona.is_public,
is_visible=persona.is_visible,
icon_shape=persona.icon_shape,
icon_color=persona.icon_color,
uploaded_image_id=persona.uploaded_image_id,
user_file_ids=[file.id for file in persona.user_files],
user_folder_ids=[folder.id for folder in persona.user_folders],
display_priority=persona.display_priority,
is_default_persona=persona.is_default_persona,
builtin_persona=persona.builtin_persona,
starter_messages=persona.starter_messages,
tools=[ToolSnapshot.from_model(tool) for tool in persona.tools],
labels=[PersonaLabelSnapshot.from_model(label) for label in persona.labels],
owner=(
MinimalUserSnapshot(id=persona.user.id, email=persona.user.email)
if persona.user
else None
),
is_visible=persona.is_visible,
is_public=persona.is_public,
display_priority=persona.display_priority,
description=persona.description,
num_chunks=persona.num_chunks,
search_start_date=persona.search_start_date,
prompts=[PromptSnapshot.from_model(prompt) for prompt in persona.prompts],
llm_relevance_filter=persona.llm_relevance_filter,
llm_filter_extraction=persona.llm_filter_extraction,
llm_model_provider_override=persona.llm_model_provider_override,
llm_model_version_override=persona.llm_model_version_override,
starter_messages=persona.starter_messages,
builtin_persona=persona.builtin_persona,
is_default_persona=persona.is_default_persona,
prompts=[PromptSnapshot.from_model(prompt) for prompt in persona.prompts],
tools=[ToolSnapshot.from_model(tool) for tool in persona.tools],
document_sets=[
DocumentSet.from_model(document_set_model)
for document_set_model in persona.document_sets
],
users=[
MinimalUserSnapshot(id=user.id, email=user.email)
for user in persona.users
],
groups=[user_group.id for user_group in persona.groups],
icon_color=persona.icon_color,
icon_shape=persona.icon_shape,
uploaded_image_id=persona.uploaded_image_id,
search_start_date=persona.search_start_date,
labels=[PersonaLabelSnapshot.from_model(label) for label in persona.labels],
user_file_ids=[file.id for file in persona.user_files],
user_folder_ids=[folder.id for folder in persona.user_folders],
)

View File

@@ -10,7 +10,7 @@ from sqlalchemy.orm import Session
from onyx.auth.users import current_admin_user
from onyx.auth.users import current_curator_or_admin_user
from onyx.background.celery.versioned_apps.primary import app as primary_app
from onyx.background.celery.versioned_apps.client import app as client_app
from onyx.configs.app_configs import GENERATIVE_MODEL_ACCESS_CHECK_FREQ
from onyx.configs.constants import DocumentSource
from onyx.configs.constants import KV_GEN_AI_KEY_CHECK_TIME
@@ -192,7 +192,7 @@ def create_deletion_attempt_for_connector_id(
db_session.commit()
# run the beat task to pick up this deletion from the db immediately
primary_app.send_task(
client_app.send_task(
OnyxCeleryTask.CHECK_FOR_CONNECTOR_DELETION,
priority=OnyxCeleryPriority.HIGH,
kwargs={"tenant_id": tenant_id},

View File

@@ -19,6 +19,7 @@ from onyx.db.models import SlackBot as SlackAppModel
from onyx.db.models import SlackChannelConfig as SlackChannelConfigModel
from onyx.db.models import User
from onyx.onyxbot.slack.config import VALID_SLACK_FILTERS
from onyx.server.features.persona.models import FullPersonaSnapshot
from onyx.server.features.persona.models import PersonaSnapshot
from onyx.server.models import FullUserSnapshot
from onyx.server.models import InvitedUserSnapshot
@@ -245,7 +246,7 @@ class SlackChannelConfig(BaseModel):
id=slack_channel_config_model.id,
slack_bot_id=slack_channel_config_model.slack_bot_id,
persona=(
PersonaSnapshot.from_model(
FullPersonaSnapshot.from_model(
slack_channel_config_model.persona, allow_deleted=True
)
if slack_channel_config_model.persona

View File

@@ -117,7 +117,11 @@ def set_new_search_settings(
search_settings_id=search_settings.id, db_session=db_session
)
for cc_pair in get_connector_credential_pairs(db_session):
resync_cc_pair(cc_pair, db_session=db_session)
resync_cc_pair(
cc_pair=cc_pair,
search_settings_id=new_search_settings.id,
db_session=db_session,
)
db_session.commit()
return IdReturn(id=new_search_settings.id)

View File

@@ -96,7 +96,11 @@ def setup_onyx(
)
for cc_pair in get_connector_credential_pairs(db_session):
resync_cc_pair(cc_pair, db_session=db_session)
resync_cc_pair(
cc_pair=cc_pair,
search_settings_id=search_settings.id,
db_session=db_session,
)
# Expire all old embedding models indexing attempts, technically redundant
cancel_indexing_attempts_past_model(db_session)

View File

@@ -1,4 +1,5 @@
from collections.abc import Callable
from datetime import datetime
from typing import Any
from uuid import UUID
@@ -6,6 +7,7 @@ from pydantic import BaseModel
from pydantic import model_validator
from sqlalchemy.orm import Session
from onyx.configs.constants import DocumentSource
from onyx.context.search.enums import SearchType
from onyx.context.search.models import IndexFilters
from onyx.context.search.models import InferenceSection
@@ -75,6 +77,8 @@ class SearchToolOverrideKwargs(BaseModel):
ordering_only: bool | None = (
None # Flag for fast path when search is only needed for ordering
)
document_sources: list[DocumentSource] | None = None
time_cutoff: datetime | None = None
class Config:
arbitrary_types_allowed = True

View File

@@ -292,6 +292,8 @@ class SearchTool(Tool[SearchToolOverrideKwargs]):
user_file_ids = None
user_folder_ids = None
ordering_only = False
document_sources = None
time_cutoff = None
if override_kwargs:
force_no_rerank = use_alt_not_None(override_kwargs.force_no_rerank, False)
alternate_db_session = override_kwargs.alternate_db_session
@@ -302,6 +304,8 @@ class SearchTool(Tool[SearchToolOverrideKwargs]):
user_file_ids = override_kwargs.user_file_ids
user_folder_ids = override_kwargs.user_folder_ids
ordering_only = use_alt_not_None(override_kwargs.ordering_only, False)
document_sources = override_kwargs.document_sources
time_cutoff = override_kwargs.time_cutoff
# Fast path for ordering-only search
if ordering_only:
@@ -334,6 +338,23 @@ class SearchTool(Tool[SearchToolOverrideKwargs]):
)
retrieval_options = RetrievalDetails(filters=filters)
if document_sources or time_cutoff:
# Get retrieval_options and filters, or create if they don't exist
retrieval_options = retrieval_options or RetrievalDetails()
retrieval_options.filters = retrieval_options.filters or BaseFilters()
# Handle document sources
if document_sources:
source_types = retrieval_options.filters.source_type or []
retrieval_options.filters.source_type = list(
set(source_types + document_sources)
)
# Handle time cutoff
if time_cutoff:
# Overwrite time-cutoff should supercede existing time-cutoff, even if defined
retrieval_options.filters.time_cutoff = time_cutoff
search_pipeline = SearchPipeline(
search_request=SearchRequest(
query=query,
@@ -376,6 +397,7 @@ class SearchTool(Tool[SearchToolOverrideKwargs]):
db_session=alternate_db_session or self.db_session,
prompt_config=self.prompt_config,
retrieved_sections_callback=retrieved_sections_callback,
contextual_pruning_config=self.contextual_pruning_config,
)
search_query_info = SearchQueryInfo(
@@ -447,6 +469,7 @@ class SearchTool(Tool[SearchToolOverrideKwargs]):
db_session=self.db_session,
bypass_acl=self.bypass_acl,
prompt_config=self.prompt_config,
contextual_pruning_config=self.contextual_pruning_config,
)
# Log what we're doing

View File

@@ -13,6 +13,7 @@ from shared_configs.configs import POSTGRES_DEFAULT_SCHEMA
from shared_configs.configs import SLACK_CHANNEL_ID
from shared_configs.configs import TENANT_ID_PREFIX
from shared_configs.contextvars import CURRENT_TENANT_ID_CONTEXTVAR
from shared_configs.contextvars import ONYX_REQUEST_ID_CONTEXTVAR
logging.addLevelName(logging.INFO + 5, "NOTICE")
@@ -71,6 +72,14 @@ def get_log_level_from_str(log_level_str: str = LOG_LEVEL) -> int:
return log_level_dict.get(log_level_str.upper(), logging.getLevelName("NOTICE"))
class OnyxRequestIDFilter(logging.Filter):
def filter(self, record: logging.LogRecord) -> bool:
from shared_configs.contextvars import ONYX_REQUEST_ID_CONTEXTVAR
record.request_id = ONYX_REQUEST_ID_CONTEXTVAR.get() or "-"
return True
class OnyxLoggingAdapter(logging.LoggerAdapter):
def process(
self, msg: str, kwargs: MutableMapping[str, Any]
@@ -103,6 +112,7 @@ class OnyxLoggingAdapter(logging.LoggerAdapter):
msg = f"[CC Pair: {cc_pair_id}] {msg}"
break
# Add tenant information if it differs from default
# This will always be the case for authenticated API requests
if MULTI_TENANT:
@@ -115,6 +125,11 @@ class OnyxLoggingAdapter(logging.LoggerAdapter):
)
msg = f"[t:{short_tenant}] {msg}"
# request id within a fastapi route
fastapi_request_id = ONYX_REQUEST_ID_CONTEXTVAR.get()
if fastapi_request_id:
msg = f"[{fastapi_request_id}] {msg}"
# For Slack Bot, logs the channel relevant to the request
channel_id = self.extra.get(SLACK_CHANNEL_ID) if self.extra else None
if channel_id:
@@ -165,6 +180,14 @@ class ColoredFormatter(logging.Formatter):
return super().format(record)
def get_uvicorn_standard_formatter() -> ColoredFormatter:
"""Returns a standard colored logging formatter."""
return ColoredFormatter(
"%(asctime)s %(filename)30s %(lineno)4s: [%(request_id)s] %(message)s",
datefmt="%m/%d/%Y %I:%M:%S %p",
)
def get_standard_formatter() -> ColoredFormatter:
"""Returns a standard colored logging formatter."""
return ColoredFormatter(
@@ -201,12 +224,6 @@ def setup_logger(
logger.addHandler(handler)
uvicorn_logger = logging.getLogger("uvicorn.access")
if uvicorn_logger:
uvicorn_logger.handlers = []
uvicorn_logger.addHandler(handler)
uvicorn_logger.setLevel(log_level)
is_containerized = is_running_in_container()
if LOG_FILE_NAME and (is_containerized or DEV_LOGGING_ENABLED):
log_levels = ["debug", "info", "notice"]
@@ -225,14 +242,37 @@ def setup_logger(
file_handler.setFormatter(formatter)
logger.addHandler(file_handler)
if uvicorn_logger:
uvicorn_logger.addHandler(file_handler)
logger.notice = lambda msg, *args, **kwargs: logger.log(logging.getLevelName("NOTICE"), msg, *args, **kwargs) # type: ignore
return OnyxLoggingAdapter(logger, extra=extra)
def setup_uvicorn_logger(
log_level: int = get_log_level_from_str(),
shared_file_handlers: list[logging.FileHandler] | None = None,
) -> None:
uvicorn_logger = logging.getLogger("uvicorn.access")
if not uvicorn_logger:
return
formatter = get_uvicorn_standard_formatter()
handler = logging.StreamHandler()
handler.setLevel(log_level)
handler.setFormatter(formatter)
uvicorn_logger.handlers = []
uvicorn_logger.addHandler(handler)
uvicorn_logger.setLevel(log_level)
uvicorn_logger.addFilter(OnyxRequestIDFilter())
if shared_file_handlers:
for fh in shared_file_handlers:
uvicorn_logger.addHandler(fh)
return
def print_loggers() -> None:
"""Print information about all loggers. Use to debug logging issues."""
root_logger = logging.getLogger()

View File

@@ -0,0 +1,62 @@
import base64
import hashlib
import logging
import uuid
from collections.abc import Awaitable
from collections.abc import Callable
from datetime import datetime
from datetime import timezone
from fastapi import FastAPI
from fastapi import Request
from fastapi import Response
from shared_configs.contextvars import ONYX_REQUEST_ID_CONTEXTVAR
def add_onyx_request_id_middleware(
app: FastAPI, prefix: str, logger: logging.LoggerAdapter
) -> None:
@app.middleware("http")
async def set_request_id(
request: Request, call_next: Callable[[Request], Awaitable[Response]]
) -> Response:
"""Generate a request hash that can be used to track the lifecycle
of a request. The hash is prefixed to help indicated where the request id
originated.
Format is f"{PREFIX}:{ID}" where PREFIX is 3 chars and ID is 8 chars.
Total length is 12 chars.
"""
onyx_request_id = request.headers.get("X-Onyx-Request-ID")
if not onyx_request_id:
onyx_request_id = make_randomized_onyx_request_id(prefix)
ONYX_REQUEST_ID_CONTEXTVAR.set(onyx_request_id)
return await call_next(request)
def make_randomized_onyx_request_id(prefix: str) -> str:
"""generates a randomized request id"""
hash_input = str(uuid.uuid4())
return _make_onyx_request_id(prefix, hash_input)
def make_structured_onyx_request_id(prefix: str, request_url: str) -> str:
"""Not used yet, but could be in the future!"""
hash_input = f"{request_url}:{datetime.now(timezone.utc)}"
return _make_onyx_request_id(prefix, hash_input)
def _make_onyx_request_id(prefix: str, hash_input: str) -> str:
"""helper function to return an id given a string input"""
hash_obj = hashlib.md5(hash_input.encode("utf-8"))
hash_bytes = hash_obj.digest()[:6] # Truncate to 6 bytes
# 6 bytes becomes 8 bytes. we shouldn't need to strip but just in case
# NOTE: possible we'll want more input bytes if id's aren't unique enough
hash_str = base64.urlsafe_b64encode(hash_bytes).decode("utf-8").rstrip("=")
onyx_request_id = f"{prefix}:{hash_str}"
return onyx_request_id

View File

@@ -332,14 +332,15 @@ def wait_on_background(task: TimeoutThread[R]) -> R:
return task.result
def _next_or_none(ind: int, g: Iterator[R]) -> tuple[int, R | None]:
return ind, next(g, None)
def _next_or_none(ind: int, gen: Iterator[R]) -> tuple[int, R | None]:
return ind, next(gen, None)
def parallel_yield(gens: list[Iterator[R]], max_workers: int = 10) -> Iterator[R]:
with ThreadPoolExecutor(max_workers=max_workers) as executor:
future_to_index: dict[Future[tuple[int, R | None]], int] = {
executor.submit(_next_or_none, i, g): i for i, g in enumerate(gens)
executor.submit(_next_or_none, ind, gen): ind
for ind, gen in enumerate(gens)
}
next_ind = len(gens)

View File

@@ -95,4 +95,5 @@ urllib3==2.2.3
mistune==0.8.4
sentry-sdk==2.14.0
prometheus_client==0.21.0
fastapi-limiter==0.1.6
fastapi-limiter==0.1.6
prometheus_fastapi_instrumentator==7.1.0

View File

@@ -15,4 +15,5 @@ uvicorn==0.21.1
voyageai==0.2.3
litellm==1.61.16
sentry-sdk[fastapi,celery,starlette]==2.14.0
aioboto3==13.4.0
aioboto3==13.4.0
prometheus_fastapi_instrumentator==7.1.0

View File

@@ -58,6 +58,7 @@ INDEXING_ONLY = os.environ.get("INDEXING_ONLY", "").lower() == "true"
# The process needs to have this for the log file to write to
# otherwise, it will not create additional log files
# This should just be the filename base without extension or path.
LOG_FILE_NAME = os.environ.get("LOG_FILE_NAME") or "onyx"
# Enable generating persistent log files for local dev environments

View File

@@ -11,6 +11,15 @@ CURRENT_TENANT_ID_CONTEXTVAR: contextvars.ContextVar[
"current_tenant_id", default=None if MULTI_TENANT else POSTGRES_DEFAULT_SCHEMA
)
# set by every route in the API server
INDEXING_REQUEST_ID_CONTEXTVAR: contextvars.ContextVar[
str | None
] = contextvars.ContextVar("indexing_request_id", default=None)
# set by every route in the API server
ONYX_REQUEST_ID_CONTEXTVAR: contextvars.ContextVar[str | None] = contextvars.ContextVar(
"onyx_request_id", default=None
)
"""Utils related to contextvars"""

View File

@@ -34,7 +34,7 @@ def confluence_connector(space: str) -> ConfluenceConnector:
return connector
@pytest.mark.parametrize("space", [os.environ["CONFLUENCE_TEST_SPACE"]])
@pytest.mark.parametrize("space", [os.getenv("CONFLUENCE_TEST_SPACE") or "DailyConne"])
@patch(
"onyx.file_processing.extract_file_text.get_unstructured_api_key",
return_value=None,

View File

@@ -0,0 +1,44 @@
import os
import time
from unittest.mock import MagicMock
from unittest.mock import patch
import pytest
from onyx.connectors.gong.connector import GongConnector
from onyx.connectors.models import Document
@pytest.fixture
def gong_connector() -> GongConnector:
connector = GongConnector()
connector.load_credentials(
{
"gong_access_key": os.environ["GONG_ACCESS_KEY"],
"gong_access_key_secret": os.environ["GONG_ACCESS_KEY_SECRET"],
}
)
return connector
@patch(
"onyx.file_processing.extract_file_text.get_unstructured_api_key",
return_value=None,
)
def test_gong_basic(mock_get_api_key: MagicMock, gong_connector: GongConnector) -> None:
doc_batch_generator = gong_connector.poll_source(0, time.time())
doc_batch = next(doc_batch_generator)
with pytest.raises(StopIteration):
next(doc_batch_generator)
assert len(doc_batch) == 2
docs: list[Document] = []
for doc in doc_batch:
docs.append(doc)
assert docs[0].semantic_identifier == "test with chris"
assert docs[1].semantic_identifier == "Testing Gong"

View File

@@ -1,6 +1,7 @@
import json
import os
import time
from datetime import datetime
from pathlib import Path
from unittest.mock import MagicMock
from unittest.mock import patch
@@ -105,6 +106,54 @@ def test_highspot_connector_slim(
assert len(all_slim_doc_ids) > 0
@patch(
"onyx.file_processing.extract_file_text.get_unstructured_api_key",
return_value=None,
)
def test_highspot_connector_poll_source(
mock_get_api_key: MagicMock, highspot_connector: HighspotConnector
) -> None:
"""Test poll_source functionality with date range filtering."""
# Define date range: April 3, 2025 to April 4, 2025
start_date = datetime(2025, 4, 3, 0, 0, 0)
end_date = datetime(2025, 4, 4, 23, 59, 59)
# Convert to seconds since Unix epoch
start_time = int(time.mktime(start_date.timetuple()))
end_time = int(time.mktime(end_date.timetuple()))
# Load test data for assertions
test_data = load_test_data()
poll_source_data = test_data.get("poll_source", {})
target_doc_id = poll_source_data.get("target_doc_id")
# Call poll_source with date range
all_docs: list[Document] = []
target_doc: Document | None = None
for doc_batch in highspot_connector.poll_source(start_time, end_time):
for doc in doc_batch:
all_docs.append(doc)
if doc.id == f"HIGHSPOT_{target_doc_id}":
target_doc = doc
# Verify documents were loaded
assert len(all_docs) > 0
# Verify the specific test document was found and has correct properties
assert target_doc is not None
assert target_doc.semantic_identifier == poll_source_data.get("semantic_identifier")
assert target_doc.source == DocumentSource.HIGHSPOT
assert target_doc.metadata is not None
# Verify sections
assert len(target_doc.sections) == 1
section = target_doc.sections[0]
assert section.link == poll_source_data.get("link")
assert section.text is not None
assert len(section.text) > 0
def test_highspot_connector_validate_credentials(
highspot_connector: HighspotConnector,
) -> None:

View File

@@ -1,5 +1,10 @@
{
"target_doc_id": "67cd8eb35d3ee0487de2e704",
"semantic_identifier": "Highspot in Action _ Salesforce Integration",
"link": "https://www.highspot.com/items/67cd8eb35d3ee0487de2e704"
"link": "https://www.highspot.com/items/67cd8eb35d3ee0487de2e704",
"poll_source": {
"target_doc_id":"67ef9edcc3f40b2bf3d816a8",
"semantic_identifier":"A Brief Introduction To AI",
"link":"https://www.highspot.com/items/67ef9edcc3f40b2bf3d816a8"
}
}

View File

@@ -35,23 +35,22 @@ def salesforce_connector() -> SalesforceConnector:
connector = SalesforceConnector(
requested_objects=["Account", "Contact", "Opportunity"],
)
username = os.environ["SF_USERNAME"]
password = os.environ["SF_PASSWORD"]
security_token = os.environ["SF_SECURITY_TOKEN"]
connector.load_credentials(
{
"sf_username": os.environ["SF_USERNAME"],
"sf_password": os.environ["SF_PASSWORD"],
"sf_security_token": os.environ["SF_SECURITY_TOKEN"],
"sf_username": username,
"sf_password": password,
"sf_security_token": security_token,
}
)
return connector
# TODO: make the credentials not expire
@pytest.mark.xfail(
reason=(
"Credentials change over time, so this test will fail if run when "
"the credentials expire."
)
)
def test_salesforce_connector_basic(salesforce_connector: SalesforceConnector) -> None:
test_data = load_test_data()
target_test_doc: Document | None = None
@@ -61,21 +60,26 @@ def test_salesforce_connector_basic(salesforce_connector: SalesforceConnector) -
all_docs.append(doc)
if doc.id == test_data["id"]:
target_test_doc = doc
break
# The number of docs here seems to change actively so do a very loose check
# as of 2025-03-28 it was around 32472
assert len(all_docs) > 32000
assert len(all_docs) < 40000
assert len(all_docs) == 6
assert target_test_doc is not None
# Set of received links
received_links: set[str] = set()
# List of received text fields, which contain key-value pairs seperated by newlines
recieved_text: list[str] = []
received_text: list[str] = []
# Iterate over the sections of the target test doc to extract the links and text
for section in target_test_doc.sections:
assert section.link
assert section.text
received_links.add(section.link)
recieved_text.append(section.text)
received_text.append(section.text)
# Check that the received links match the expected links from the test data json
expected_links = set(test_data["expected_links"])
@@ -85,8 +89,9 @@ def test_salesforce_connector_basic(salesforce_connector: SalesforceConnector) -
expected_text = test_data["expected_text"]
if not isinstance(expected_text, list):
raise ValueError("Expected text is not a list")
unparsed_expected_key_value_pairs: list[str] = expected_text
received_key_value_pairs = extract_key_value_pairs_to_set(recieved_text)
received_key_value_pairs = extract_key_value_pairs_to_set(received_text)
expected_key_value_pairs = extract_key_value_pairs_to_set(
unparsed_expected_key_value_pairs
)
@@ -96,13 +101,21 @@ def test_salesforce_connector_basic(salesforce_connector: SalesforceConnector) -
assert target_test_doc.source == DocumentSource.SALESFORCE
assert target_test_doc.semantic_identifier == test_data["semantic_identifier"]
assert target_test_doc.metadata == test_data["metadata"]
assert target_test_doc.primary_owners == test_data["primary_owners"]
assert target_test_doc.primary_owners is not None
primary_owner = target_test_doc.primary_owners[0]
expected_primary_owner = test_data["primary_owners"]
assert isinstance(expected_primary_owner, dict)
assert primary_owner.email == expected_primary_owner["email"]
assert primary_owner.first_name == expected_primary_owner["first_name"]
assert primary_owner.last_name == expected_primary_owner["last_name"]
assert target_test_doc.secondary_owners == test_data["secondary_owners"]
assert target_test_doc.title == test_data["title"]
# TODO: make the credentials not expire
@pytest.mark.xfail(
@pytest.mark.skip(
reason=(
"Credentials change over time, so this test will fail if run when "
"the credentials expire."

View File

@@ -1,20 +1,162 @@
{
"id": "SALESFORCE_001fI000005drUcQAI",
"id": "SALESFORCE_001bm00000eu6n5AAA",
"expected_links": [
"https://customization-ruby-2195.my.salesforce.com/001fI000005drUcQAI",
"https://customization-ruby-2195.my.salesforce.com/003fI000001jiCPQAY",
"https://customization-ruby-2195.my.salesforce.com/017fI00000T7hvsQAB",
"https://customization-ruby-2195.my.salesforce.com/006fI000000rDvBQAU"
"https://danswer-dev-ed.develop.my.salesforce.com/003bm00000ESpEeAAL",
"https://danswer-dev-ed.develop.my.salesforce.com/003bm00000ESqd3AAD",
"https://danswer-dev-ed.develop.my.salesforce.com/003bm00000ESoKiAAL",
"https://danswer-dev-ed.develop.my.salesforce.com/003bm00000ESvDSAA1",
"https://danswer-dev-ed.develop.my.salesforce.com/003bm00000ESrmHAAT",
"https://danswer-dev-ed.develop.my.salesforce.com/003bm00000ESrl2AAD",
"https://danswer-dev-ed.develop.my.salesforce.com/003bm00000ESvejAAD",
"https://danswer-dev-ed.develop.my.salesforce.com/003bm00000EStlvAAD",
"https://danswer-dev-ed.develop.my.salesforce.com/003bm00000ESpPfAAL",
"https://danswer-dev-ed.develop.my.salesforce.com/003bm00000ESrP9AAL",
"https://danswer-dev-ed.develop.my.salesforce.com/003bm00000ESvlMAAT",
"https://danswer-dev-ed.develop.my.salesforce.com/003bm00000ESt3JAAT",
"https://danswer-dev-ed.develop.my.salesforce.com/003bm00000ESoBkAAL",
"https://danswer-dev-ed.develop.my.salesforce.com/003bm00000EStw2AAD",
"https://danswer-dev-ed.develop.my.salesforce.com/003bm00000ESrkMAAT",
"https://danswer-dev-ed.develop.my.salesforce.com/003bm00000ESojKAAT",
"https://danswer-dev-ed.develop.my.salesforce.com/003bm00000ESuLEAA1",
"https://danswer-dev-ed.develop.my.salesforce.com/003bm00000ESoSIAA1",
"https://danswer-dev-ed.develop.my.salesforce.com/003bm00000ESu2YAAT",
"https://danswer-dev-ed.develop.my.salesforce.com/003bm00000ESvgSAAT",
"https://danswer-dev-ed.develop.my.salesforce.com/003bm00000ESurnAAD",
"https://danswer-dev-ed.develop.my.salesforce.com/003bm00000ESrnqAAD",
"https://danswer-dev-ed.develop.my.salesforce.com/003bm00000ESoB5AAL",
"https://danswer-dev-ed.develop.my.salesforce.com/003bm00000ESuJuAAL",
"https://danswer-dev-ed.develop.my.salesforce.com/003bm00000ESrfyAAD",
"https://danswer-dev-ed.develop.my.salesforce.com/001bm00000eu6n5AAA",
"https://danswer-dev-ed.develop.my.salesforce.com/003bm00000ESpUHAA1",
"https://danswer-dev-ed.develop.my.salesforce.com/003bm00000ESsgGAAT",
"https://danswer-dev-ed.develop.my.salesforce.com/003bm00000ESr7UAAT",
"https://danswer-dev-ed.develop.my.salesforce.com/003bm00000ESu1BAAT",
"https://danswer-dev-ed.develop.my.salesforce.com/003bm00000ESpqzAAD",
"https://danswer-dev-ed.develop.my.salesforce.com/003bm00000ESplZAAT",
"https://danswer-dev-ed.develop.my.salesforce.com/003bm00000ESvJ3AAL",
"https://danswer-dev-ed.develop.my.salesforce.com/003bm00000ESurKAAT",
"https://danswer-dev-ed.develop.my.salesforce.com/003bm00000EStSiAAL",
"https://danswer-dev-ed.develop.my.salesforce.com/003bm00000ESuJFAA1",
"https://danswer-dev-ed.develop.my.salesforce.com/003bm00000ESu8xAAD",
"https://danswer-dev-ed.develop.my.salesforce.com/003bm00000ESqfzAAD",
"https://danswer-dev-ed.develop.my.salesforce.com/003bm00000ESqsrAAD",
"https://danswer-dev-ed.develop.my.salesforce.com/003bm00000EStoZAAT",
"https://danswer-dev-ed.develop.my.salesforce.com/003bm00000ESsIUAA1",
"https://danswer-dev-ed.develop.my.salesforce.com/003bm00000ESsAGAA1",
"https://danswer-dev-ed.develop.my.salesforce.com/003bm00000ESv8GAAT",
"https://danswer-dev-ed.develop.my.salesforce.com/003bm00000ESrOKAA1",
"https://danswer-dev-ed.develop.my.salesforce.com/003bm00000ESoUmAAL",
"https://danswer-dev-ed.develop.my.salesforce.com/003bm00000ESudKAAT",
"https://danswer-dev-ed.develop.my.salesforce.com/003bm00000ESuJ8AAL",
"https://danswer-dev-ed.develop.my.salesforce.com/003bm00000ESvf2AAD",
"https://danswer-dev-ed.develop.my.salesforce.com/003bm00000ESw3qAAD",
"https://danswer-dev-ed.develop.my.salesforce.com/003bm00000ESugRAAT",
"https://danswer-dev-ed.develop.my.salesforce.com/003bm00000ESr18AAD",
"https://danswer-dev-ed.develop.my.salesforce.com/003bm00000ESqV1AAL",
"https://danswer-dev-ed.develop.my.salesforce.com/003bm00000ESuLVAA1",
"https://danswer-dev-ed.develop.my.salesforce.com/003bm00000ESpjoAAD",
"https://danswer-dev-ed.develop.my.salesforce.com/003bm00000ESqULAA1",
"https://danswer-dev-ed.develop.my.salesforce.com/003bm00000ESuCAAA1",
"https://danswer-dev-ed.develop.my.salesforce.com/003bm00000ESrfpAAD",
"https://danswer-dev-ed.develop.my.salesforce.com/003bm00000ESp5YAAT",
"https://danswer-dev-ed.develop.my.salesforce.com/003bm00000ESrMNAA1",
"https://danswer-dev-ed.develop.my.salesforce.com/003bm00000EStaUAAT",
"https://danswer-dev-ed.develop.my.salesforce.com/003bm00000ESt5LAAT",
"https://danswer-dev-ed.develop.my.salesforce.com/003bm00000ESrtcAAD",
"https://danswer-dev-ed.develop.my.salesforce.com/003bm00000ESomaAAD",
"https://danswer-dev-ed.develop.my.salesforce.com/003bm00000ESrtIAAT",
"https://danswer-dev-ed.develop.my.salesforce.com/003bm00000ESoToAAL",
"https://danswer-dev-ed.develop.my.salesforce.com/003bm00000ESuWLAA1",
"https://danswer-dev-ed.develop.my.salesforce.com/003bm00000ESrWvAAL",
"https://danswer-dev-ed.develop.my.salesforce.com/003bm00000ESsJEAA1",
"https://danswer-dev-ed.develop.my.salesforce.com/003bm00000ESsxwAAD",
"https://danswer-dev-ed.develop.my.salesforce.com/003bm00000ESvUgAAL",
"https://danswer-dev-ed.develop.my.salesforce.com/003bm00000ESvWjAAL",
"https://danswer-dev-ed.develop.my.salesforce.com/003bm00000EStBuAAL",
"https://danswer-dev-ed.develop.my.salesforce.com/003bm00000ESpZiAAL",
"https://danswer-dev-ed.develop.my.salesforce.com/003bm00000ESuhYAAT",
"https://danswer-dev-ed.develop.my.salesforce.com/003bm00000ESuWAAA1"
],
"expected_text": [
"BillingPostalCode: 60601\nType: Prospect\nWebsite: www.globalistindustries.com\nBillingCity: Chicago\nDescription: Globalist company\nIsDeleted: false\nIsPartner: false\nPhone: (312) 555-0456\nShippingCountry: USA\nShippingState: IL\nIsBuyer: false\nBillingCountry: USA\nBillingState: IL\nShippingPostalCode: 60601\nBillingStreet: 456 Market St\nIsCustomerPortal: false\nPersonActiveTrackerCount: 0\nShippingCity: Chicago\nShippingStreet: 456 Market St",
"FirstName: Michael\nMailingCountry: USA\nActiveTrackerCount: 0\nEmail: m.brown@globalindustries.com\nMailingState: IL\nMailingStreet: 456 Market St\nMailingCity: Chicago\nLastName: Brown\nTitle: CTO\nIsDeleted: false\nPhone: (312) 555-0456\nHasOptedOutOfEmail: false\nIsEmailBounced: false\nMailingPostalCode: 60601",
"ForecastCategory: Closed\nName: Global Industries Equipment Sale\nIsDeleted: false\nForecastCategoryName: Closed\nFiscalYear: 2024\nFiscalQuarter: 4\nIsClosed: true\nIsWon: true\nAmount: 5000000.0\nProbability: 100.0\nPushCount: 0\nHasOverdueTask: false\nStageName: Closed Won\nHasOpenActivity: false\nHasOpportunityLineItem: false",
"Field: created\nDataType: Text\nIsDeleted: false"
"IsDeleted: false\nBillingCity: Shaykh al \u00e1\u00b8\u00a8ad\u00c4\u00abd\nName: Voonder\nCleanStatus: Pending\nBillingStreet: 12 Cambridge Parkway",
"Email: eslayqzs@icio.us\nIsDeleted: false\nLastName: Slay\nIsEmailBounced: false\nFirstName: Ebeneser\nIsPriorityRecord: false\nCleanStatus: Pending",
"Email: ptweedgdh@umich.edu\nIsDeleted: false\nLastName: Tweed\nIsEmailBounced: false\nFirstName: Paulita\nIsPriorityRecord: false\nCleanStatus: Pending",
"Email: ehurnellnlx@facebook.com\nIsDeleted: false\nLastName: Hurnell\nIsEmailBounced: false\nFirstName: Eliot\nIsPriorityRecord: false\nCleanStatus: Pending",
"Email: ccarik4q4@google.it\nIsDeleted: false\nLastName: Carik\nIsEmailBounced: false\nFirstName: Chadwick\nIsPriorityRecord: false\nCleanStatus: Pending",
"Email: cvannozziina6@moonfruit.com\nIsDeleted: false\nLastName: Vannozzii\nIsEmailBounced: false\nFirstName: Christophorus\nIsPriorityRecord: false\nCleanStatus: Pending",
"Email: mikringill2kz@hugedomains.com\nIsDeleted: false\nLastName: Ikringill\nIsEmailBounced: false\nFirstName: Meghann\nIsPriorityRecord: false\nCleanStatus: Pending",
"Email: bgrinvalray@fda.gov\nIsDeleted: false\nLastName: Grinval\nIsEmailBounced: false\nFirstName: Berti\nIsPriorityRecord: false\nCleanStatus: Pending",
"Email: aollanderhr7@cam.ac.uk\nIsDeleted: false\nLastName: Ollander\nIsEmailBounced: false\nFirstName: Annemarie\nIsPriorityRecord: false\nCleanStatus: Pending",
"Email: rwhitesideq38@gravatar.com\nIsDeleted: false\nLastName: Whiteside\nIsEmailBounced: false\nFirstName: Rolando\nIsPriorityRecord: false\nCleanStatus: Pending",
"Email: vkrafthmz@techcrunch.com\nIsDeleted: false\nLastName: Kraft\nIsEmailBounced: false\nFirstName: Vidovik\nIsPriorityRecord: false\nCleanStatus: Pending",
"Email: jhillaut@4shared.com\nIsDeleted: false\nLastName: Hill\nIsEmailBounced: false\nFirstName: Janel\nIsPriorityRecord: false\nCleanStatus: Pending",
"Email: lralstonycs@discovery.com\nIsDeleted: false\nLastName: Ralston\nIsEmailBounced: false\nFirstName: Lorrayne\nIsPriorityRecord: false\nCleanStatus: Pending",
"Email: blyttlewba@networkadvertising.org\nIsDeleted: false\nLastName: Lyttle\nIsEmailBounced: false\nFirstName: Ban\nIsPriorityRecord: false\nCleanStatus: Pending",
"Email: pplummernvf@technorati.com\nIsDeleted: false\nLastName: Plummer\nIsEmailBounced: false\nFirstName: Pete\nIsPriorityRecord: false\nCleanStatus: Pending",
"Email: babrahamoffxpb@theatlantic.com\nIsDeleted: false\nLastName: Abrahamoff\nIsEmailBounced: false\nFirstName: Brander\nIsPriorityRecord: false\nCleanStatus: Pending",
"Email: ahargieym0@homestead.com\nIsDeleted: false\nLastName: Hargie\nIsEmailBounced: false\nFirstName: Aili\nIsPriorityRecord: false\nCleanStatus: Pending",
"Email: hstotthp2@yelp.com\nIsDeleted: false\nLastName: Stott\nIsEmailBounced: false\nFirstName: Hartley\nIsPriorityRecord: false\nCleanStatus: Pending",
"Email: jganniclifftuvj@blinklist.com\nIsDeleted: false\nLastName: Ganniclifft\nIsEmailBounced: false\nFirstName: Jamima\nIsPriorityRecord: false\nCleanStatus: Pending",
"Email: ldodelly8q@ed.gov\nIsDeleted: false\nLastName: Dodell\nIsEmailBounced: false\nFirstName: Lynde\nIsPriorityRecord: false\nCleanStatus: Pending",
"Email: rmilner3cp@smh.com.au\nIsDeleted: false\nLastName: Milner\nIsEmailBounced: false\nFirstName: Ralph\nIsPriorityRecord: false\nCleanStatus: Pending",
"Email: gghiriardellic19@state.tx.us\nIsDeleted: false\nLastName: Ghiriardelli\nIsEmailBounced: false\nFirstName: Garv\nIsPriorityRecord: false\nCleanStatus: Pending",
"Email: rhubatschfpu@nature.com\nIsDeleted: false\nLastName: Hubatsch\nIsEmailBounced: false\nFirstName: Rose\nIsPriorityRecord: false\nCleanStatus: Pending",
"Email: mtrenholme1ws@quantcast.com\nIsDeleted: false\nLastName: Trenholme\nIsEmailBounced: false\nFirstName: Mariejeanne\nIsPriorityRecord: false\nCleanStatus: Pending",
"Email: jmussettpbd@over-blog.com\nIsDeleted: false\nLastName: Mussett\nIsEmailBounced: false\nFirstName: Juliann\nIsPriorityRecord: false\nCleanStatus: Pending",
"Email: bgoroni145@illinois.edu\nIsDeleted: false\nLastName: Goroni\nIsEmailBounced: false\nFirstName: Bernarr\nIsPriorityRecord: false\nCleanStatus: Pending",
"Email: afalls3ph@theguardian.com\nIsDeleted: false\nLastName: Falls\nIsEmailBounced: false\nFirstName: Angelia\nIsPriorityRecord: false\nCleanStatus: Pending",
"Email: lswettjoi@go.com\nIsDeleted: false\nLastName: Swett\nIsEmailBounced: false\nFirstName: Levon\nIsPriorityRecord: false\nCleanStatus: Pending",
"Email: emullinsz38@dailymotion.com\nIsDeleted: false\nLastName: Mullins\nIsEmailBounced: false\nFirstName: Elsa\nIsPriorityRecord: false\nCleanStatus: Pending",
"Email: ibernettehco@ebay.co.uk\nIsDeleted: false\nLastName: Bernette\nIsEmailBounced: false\nFirstName: Ingrid\nIsPriorityRecord: false\nCleanStatus: Pending",
"Email: trisleybtt@simplemachines.org\nIsDeleted: false\nLastName: Risley\nIsEmailBounced: false\nFirstName: Toma\nIsPriorityRecord: false\nCleanStatus: Pending",
"Email: rgypsonqx1@goodreads.com\nIsDeleted: false\nLastName: Gypson\nIsEmailBounced: false\nFirstName: Reed\nIsPriorityRecord: false\nCleanStatus: Pending",
"Email: cposvneri28@jiathis.com\nIsDeleted: false\nLastName: Posvner\nIsEmailBounced: false\nFirstName: Culley\nIsPriorityRecord: false\nCleanStatus: Pending",
"Email: awilmut2rz@geocities.jp\nIsDeleted: false\nLastName: Wilmut\nIsEmailBounced: false\nFirstName: Andy\nIsPriorityRecord: false\nCleanStatus: Pending",
"Email: aluckwellra5@exblog.jp\nIsDeleted: false\nLastName: Luckwell\nIsEmailBounced: false\nFirstName: Andreana\nIsPriorityRecord: false\nCleanStatus: Pending",
"Email: irollings26j@timesonline.co.uk\nIsDeleted: false\nLastName: Rollings\nIsEmailBounced: false\nFirstName: Ibrahim\nIsPriorityRecord: false\nCleanStatus: Pending",
"Email: gspireqpd@g.co\nIsDeleted: false\nLastName: Spire\nIsEmailBounced: false\nFirstName: Gaelan\nIsPriorityRecord: false\nCleanStatus: Pending",
"Email: sbezleyk2y@acquirethisname.com\nIsDeleted: false\nLastName: Bezley\nIsEmailBounced: false\nFirstName: Sindee\nIsPriorityRecord: false\nCleanStatus: Pending",
"Email: icollerrr@flickr.com\nIsDeleted: false\nLastName: Coller\nIsEmailBounced: false\nFirstName: Inesita\nIsPriorityRecord: false\nCleanStatus: Pending",
"Email: kfolliott1bo@nature.com\nIsDeleted: false\nLastName: Folliott\nIsEmailBounced: false\nFirstName: Kennan\nIsPriorityRecord: false\nCleanStatus: Pending",
"Email: kroofjfo@gnu.org\nIsDeleted: false\nLastName: Roof\nIsEmailBounced: false\nFirstName: Karlik\nIsPriorityRecord: false\nCleanStatus: Pending",
"Email: lcovotti8s4@rediff.com\nIsDeleted: false\nLastName: Covotti\nIsEmailBounced: false\nFirstName: Lucho\nIsPriorityRecord: false\nCleanStatus: Pending",
"Email: gpatriskson1rs@census.gov\nIsDeleted: false\nLastName: Patriskson\nIsEmailBounced: false\nFirstName: Gardener\nIsPriorityRecord: false\nCleanStatus: Pending",
"Email: spidgleyqvw@usgs.gov\nIsDeleted: false\nLastName: Pidgley\nIsEmailBounced: false\nFirstName: Simona\nIsPriorityRecord: false\nCleanStatus: Pending",
"Email: cbecarrak0i@over-blog.com\nIsDeleted: false\nLastName: Becarra\nIsEmailBounced: false\nFirstName: Cally\nIsPriorityRecord: false\nCleanStatus: Pending",
"Email: aparkman9td@bbc.co.uk\nIsDeleted: false\nLastName: Parkman\nIsEmailBounced: false\nFirstName: Agneta\nIsPriorityRecord: false\nCleanStatus: Pending",
"Email: bboddingtonhn@quantcast.com\nIsDeleted: false\nLastName: Boddington\nIsEmailBounced: false\nFirstName: Betta\nIsPriorityRecord: false\nCleanStatus: Pending",
"Email: dcasementx0p@cafepress.com\nIsDeleted: false\nLastName: Casement\nIsEmailBounced: false\nFirstName: Dannie\nIsPriorityRecord: false\nCleanStatus: Pending",
"Email: hzornbhe@latimes.com\nIsDeleted: false\nLastName: Zorn\nIsEmailBounced: false\nFirstName: Haleigh\nIsPriorityRecord: false\nCleanStatus: Pending",
"Email: cfifieldbjb@blogspot.com\nIsDeleted: false\nLastName: Fifield\nIsEmailBounced: false\nFirstName: Christalle\nIsPriorityRecord: false\nCleanStatus: Pending",
"Email: ddewerson4t3@skype.com\nIsDeleted: false\nLastName: Dewerson\nIsEmailBounced: false\nFirstName: Dyann\nIsPriorityRecord: false\nCleanStatus: Pending",
"Email: khullock52p@sohu.com\nIsDeleted: false\nLastName: Hullock\nIsEmailBounced: false\nFirstName: Kellina\nIsPriorityRecord: false\nCleanStatus: Pending",
"Email: tfremantle32n@bandcamp.com\nIsDeleted: false\nLastName: Fremantle\nIsEmailBounced: false\nFirstName: Turner\nIsPriorityRecord: false\nCleanStatus: Pending",
"Email: sbernardtylp@nps.gov\nIsDeleted: false\nLastName: Bernardt\nIsEmailBounced: false\nFirstName: Selina\nIsPriorityRecord: false\nCleanStatus: Pending",
"Email: smcgettigan8kk@slideshare.net\nIsDeleted: false\nLastName: McGettigan\nIsEmailBounced: false\nFirstName: Sada\nIsPriorityRecord: false\nCleanStatus: Pending",
"Email: wdelafontvgn@businesswire.com\nIsDeleted: false\nLastName: Delafont\nIsEmailBounced: false\nFirstName: West\nIsPriorityRecord: false\nCleanStatus: Pending",
"Email: lbelsher9ne@indiatimes.com\nIsDeleted: false\nLastName: Belsher\nIsEmailBounced: false\nFirstName: Lou\nIsPriorityRecord: false\nCleanStatus: Pending",
"Email: cgoody27y@blogtalkradio.com\nIsDeleted: false\nLastName: Goody\nIsEmailBounced: false\nFirstName: Colene\nIsPriorityRecord: false\nCleanStatus: Pending",
"Email: cstodejzz@ucoz.ru\nIsDeleted: false\nLastName: Stode\nIsEmailBounced: false\nFirstName: Curcio\nIsPriorityRecord: false\nCleanStatus: Pending",
"Email: abromidgejb@china.com.cn\nIsDeleted: false\nLastName: Bromidge\nIsEmailBounced: false\nFirstName: Ariela\nIsPriorityRecord: false\nCleanStatus: Pending",
"Email: ldelgardilloqvp@xrea.com\nIsDeleted: false\nLastName: Delgardillo\nIsEmailBounced: false\nFirstName: Lauralee\nIsPriorityRecord: false\nCleanStatus: Pending",
"Email: dcroal9t4@businessinsider.com\nIsDeleted: false\nLastName: Croal\nIsEmailBounced: false\nFirstName: Devlin\nIsPriorityRecord: false\nCleanStatus: Pending",
"Email: dclarageqzb@wordpress.com\nIsDeleted: false\nLastName: Clarage\nIsEmailBounced: false\nFirstName: Dre\nIsPriorityRecord: false\nCleanStatus: Pending",
"Email: dthirlwall3jf@taobao.com\nIsDeleted: false\nLastName: Thirlwall\nIsEmailBounced: false\nFirstName: Dareen\nIsPriorityRecord: false\nCleanStatus: Pending",
"Email: tkeddie2lj@wiley.com\nIsDeleted: false\nLastName: Keddie\nIsEmailBounced: false\nFirstName: Tandi\nIsPriorityRecord: false\nCleanStatus: Pending",
"Email: jrimingtoni3i@istockphoto.com\nIsDeleted: false\nLastName: Rimington\nIsEmailBounced: false\nFirstName: Judy\nIsPriorityRecord: false\nCleanStatus: Pending",
"Email: gtroynet@slashdot.org\nIsDeleted: false\nLastName: Troy\nIsEmailBounced: false\nFirstName: Gail\nIsPriorityRecord: false\nCleanStatus: Pending",
"Email: ebunneyh0n@meetup.com\nIsDeleted: false\nLastName: Bunney\nIsEmailBounced: false\nFirstName: Efren\nIsPriorityRecord: false\nCleanStatus: Pending",
"Email: yhaken8p3@slate.com\nIsDeleted: false\nLastName: Haken\nIsEmailBounced: false\nFirstName: Yard\nIsPriorityRecord: false\nCleanStatus: Pending",
"Email: nolliffeq6q@biblegateway.com\nIsDeleted: false\nLastName: Olliffe\nIsEmailBounced: false\nFirstName: Nani\nIsPriorityRecord: false\nCleanStatus: Pending",
"Email: bgalia9jz@odnoklassniki.ru\nIsDeleted: false\nLastName: Galia\nIsEmailBounced: false\nFirstName: Berrie\nIsPriorityRecord: false\nCleanStatus: Pending",
"Email: djedrzej3v1@google.com\nIsDeleted: false\nLastName: Jedrzej\nIsEmailBounced: false\nFirstName: Deanne\nIsPriorityRecord: false\nCleanStatus: Pending",
"Email: mcamiesh1t@fc2.com\nIsDeleted: false\nLastName: Camies\nIsEmailBounced: false\nFirstName: Mikaela\nIsPriorityRecord: false\nCleanStatus: Pending",
"Email: csunshineqni@state.tx.us\nIsDeleted: false\nLastName: Sunshine\nIsEmailBounced: false\nFirstName: Curtis\nIsPriorityRecord: false\nCleanStatus: Pending",
"Email: fiannellib46@marriott.com\nIsDeleted: false\nLastName: Iannelli\nIsEmailBounced: false\nFirstName: Felicio\nIsPriorityRecord: false\nCleanStatus: Pending"
],
"semantic_identifier": "Unknown Object",
"semantic_identifier": "Voonder",
"metadata": {},
"primary_owners": null,
"primary_owners": {"email": "hagen@danswer.ai", "first_name": "Hagen", "last_name": "oneill"},
"secondary_owners": null,
"title": null
}

View File

@@ -444,6 +444,7 @@ class CCPairManager:
)
if group_sync_result.status_code != 409:
group_sync_result.raise_for_status()
time.sleep(2)
@staticmethod
def get_doc_sync_task(

View File

@@ -165,17 +165,18 @@ class DocumentManager:
doc["fields"]["document_id"]: doc["fields"] for doc in retrieved_docs_dict
}
# NOTE(rkuo): too much log spam
# Left this here for debugging purposes.
import json
# import json
print("DEBUGGING DOCUMENTS")
print(retrieved_docs)
for doc in retrieved_docs.values():
printable_doc = doc.copy()
print(printable_doc.keys())
printable_doc.pop("embeddings")
printable_doc.pop("title_embedding")
print(json.dumps(printable_doc, indent=2))
# print("DEBUGGING DOCUMENTS")
# print(retrieved_docs)
# for doc in retrieved_docs.values():
# printable_doc = doc.copy()
# print(printable_doc.keys())
# printable_doc.pop("embeddings")
# printable_doc.pop("title_embedding")
# print(json.dumps(printable_doc, indent=2))
for document in cc_pair.documents:
retrieved_doc = retrieved_docs.get(document.id)

View File

@@ -1,3 +1,4 @@
import time
from datetime import datetime
from datetime import timedelta
from urllib.parse import urlencode
@@ -191,7 +192,7 @@ class IndexAttemptManager:
user_performing_action: DATestUser | None = None,
) -> None:
"""Wait for an IndexAttempt to complete"""
start = datetime.now()
start = time.monotonic()
while True:
index_attempt = IndexAttemptManager.get_index_attempt_by_id(
index_attempt_id=index_attempt_id,
@@ -203,7 +204,7 @@ class IndexAttemptManager:
print(f"IndexAttempt {index_attempt_id} completed")
return
elapsed = (datetime.now() - start).total_seconds()
elapsed = time.monotonic() - start
if elapsed > timeout:
raise TimeoutError(
f"IndexAttempt {index_attempt_id} did not complete within {timeout} seconds"

View File

@@ -4,7 +4,7 @@ from uuid import uuid4
import requests
from onyx.context.search.enums import RecencyBiasSetting
from onyx.server.features.persona.models import PersonaSnapshot
from onyx.server.features.persona.models import FullPersonaSnapshot
from onyx.server.features.persona.models import PersonaUpsertRequest
from tests.integration.common_utils.constants import API_SERVER_URL
from tests.integration.common_utils.constants import GENERAL_HEADERS
@@ -181,7 +181,7 @@ class PersonaManager:
@staticmethod
def get_all(
user_performing_action: DATestUser | None = None,
) -> list[PersonaSnapshot]:
) -> list[FullPersonaSnapshot]:
response = requests.get(
f"{API_SERVER_URL}/admin/persona",
headers=user_performing_action.headers
@@ -189,13 +189,13 @@ class PersonaManager:
else GENERAL_HEADERS,
)
response.raise_for_status()
return [PersonaSnapshot(**persona) for persona in response.json()]
return [FullPersonaSnapshot(**persona) for persona in response.json()]
@staticmethod
def get_one(
persona_id: int,
user_performing_action: DATestUser | None = None,
) -> list[PersonaSnapshot]:
) -> list[FullPersonaSnapshot]:
response = requests.get(
f"{API_SERVER_URL}/persona/{persona_id}",
headers=user_performing_action.headers
@@ -203,7 +203,7 @@ class PersonaManager:
else GENERAL_HEADERS,
)
response.raise_for_status()
return [PersonaSnapshot(**response.json())]
return [FullPersonaSnapshot(**response.json())]
@staticmethod
def verify(

View File

@@ -313,29 +313,3 @@ class UserManager:
)
response.raise_for_status()
return UserInfo(**response.json())
@staticmethod
def invite_users(
user_performing_action: DATestUser,
emails: list[str],
) -> int:
response = requests.put(
url=f"{API_SERVER_URL}/manage/admin/users",
json={"emails": emails},
headers=user_performing_action.headers,
)
response.raise_for_status()
return response.json()
@staticmethod
def remove_invited_user(
user_performing_action: DATestUser,
user_email: str,
) -> int:
response = requests.patch(
url=f"{API_SERVER_URL}/manage/admin/remove-invited-user",
json={"user_email": user_email},
headers=user_performing_action.headers,
)
response.raise_for_status()
return response.json()

View File

@@ -22,7 +22,6 @@ from onyx.document_index.document_index_utils import get_multipass_config
from onyx.document_index.vespa.index import DOCUMENT_ID_ENDPOINT
from onyx.document_index.vespa.index import VespaIndex
from onyx.indexing.models import IndexingSetting
from onyx.redis.redis_pool import get_redis_client
from onyx.setup import setup_postgres
from onyx.setup import setup_vespa
from onyx.utils.logger import setup_logger
@@ -238,12 +237,6 @@ def reset_vespa() -> None:
time.sleep(5)
def reset_redis() -> None:
"""Reset the Redis database."""
redis_client = get_redis_client()
redis_client.flushall()
def reset_postgres_multitenant() -> None:
"""Reset the Postgres database for all tenants in a multitenant setup."""
@@ -348,8 +341,6 @@ def reset_all() -> None:
reset_postgres()
logger.info("Resetting Vespa...")
reset_vespa()
logger.info("Resetting Redis...")
reset_redis()
def reset_all_multitenant() -> None:

View File

@@ -14,9 +14,8 @@ from tests.integration.connector_job_tests.slack.slack_api_utils import SlackMan
@pytest.fixture()
def slack_test_setup() -> Generator[tuple[dict[str, Any], dict[str, Any]], None, None]:
slack_client = SlackManager.get_slack_client(os.environ["SLACK_BOT_TOKEN"])
admin_user_id = SlackManager.build_slack_user_email_id_map(slack_client)[
"admin@onyx-test.com"
]
user_map = SlackManager.build_slack_user_email_id_map(slack_client)
admin_user_id = user_map["admin@onyx-test.com"]
(
public_channel,

View File

@@ -3,8 +3,6 @@ from datetime import datetime
from datetime import timezone
from typing import Any
import pytest
from onyx.connectors.models import InputType
from onyx.db.enums import AccessType
from onyx.server.documents.models import DocumentSource
@@ -25,7 +23,6 @@ from tests.integration.common_utils.vespa import vespa_fixture
from tests.integration.connector_job_tests.slack.slack_api_utils import SlackManager
@pytest.mark.xfail(reason="flaky - see DAN-789 for example", strict=False)
def test_slack_permission_sync(
reset: None,
vespa_client: vespa_fixture,
@@ -221,7 +218,6 @@ def test_slack_permission_sync(
assert private_message not in onyx_doc_message_strings
@pytest.mark.xfail(reason="flaky", strict=False)
def test_slack_group_permission_sync(
reset: None,
vespa_client: vespa_fixture,

View File

@@ -1,38 +0,0 @@
import pytest
from requests import HTTPError
from onyx.auth.schemas import UserRole
from tests.integration.common_utils.managers.user import UserManager
from tests.integration.common_utils.test_models import DATestUser
def test_inviting_users_flow(reset: None) -> None:
"""
Test that verifies the functionality around inviting users:
1. Creating an admin user
2. Admin inviting a new user
3. Invited user successfully signing in
4. Non-invited user attempting to sign in (should result in an error)
"""
# 1) Create an admin user (the first user created is automatically admin)
admin_user: DATestUser = UserManager.create(name="admin_user")
assert admin_user is not None
assert UserManager.is_role(admin_user, UserRole.ADMIN)
# 2) Admin invites a new user
invited_email = "invited_user@test.com"
invite_response = UserManager.invite_users(admin_user, [invited_email])
assert invite_response == 1
# 3) The invited user successfully registers/logs in
invited_user: DATestUser = UserManager.create(
name="invited_user", email=invited_email
)
assert invited_user is not None
assert invited_user.email == invited_email
assert UserManager.is_role(invited_user, UserRole.BASIC)
# 4) A non-invited user attempts to sign in/register (should fail)
with pytest.raises(HTTPError):
UserManager.create(name="uninvited_user", email="uninvited_user@test.com")

View File

@@ -5,8 +5,11 @@ from ee.onyx.external_permissions.salesforce.postprocessing import (
)
from onyx.configs.app_configs import BLURB_SIZE
from onyx.configs.constants import DocumentSource
from onyx.connectors.salesforce.utils import BASE_DATA_PATH
from onyx.context.search.models import InferenceChunk
SQLITE_DIR = BASE_DATA_PATH
def create_test_chunk(
doc_id: str,
@@ -39,6 +42,7 @@ def create_test_chunk(
def test_validate_salesforce_access_single_object() -> None:
"""Test filtering when chunk has a single Salesforce object reference"""
section = "This is a test document about a Salesforce object."
test_content = section
test_chunk = create_test_chunk(

View File

@@ -4,6 +4,7 @@ from onyx.chat.prune_and_merge import _merge_sections
from onyx.configs.constants import DocumentSource
from onyx.context.search.models import InferenceChunk
from onyx.context.search.models import InferenceSection
from onyx.context.search.utils import inference_section_from_chunks
# This large test accounts for all of the following:
@@ -111,7 +112,7 @@ Content 17
# Sections
[
# Document 1, top/middle/bot connected + disconnected section
InferenceSection(
inference_section_from_chunks(
center_chunk=DOC_1_TOP_CHUNK,
chunks=[
DOC_1_FILLER_1,
@@ -120,9 +121,8 @@ Content 17
DOC_1_MID_CHUNK,
DOC_1_FILLER_3,
],
combined_content="N/A", # Not used
),
InferenceSection(
inference_section_from_chunks(
center_chunk=DOC_1_MID_CHUNK,
chunks=[
DOC_1_FILLER_2,
@@ -131,9 +131,8 @@ Content 17
DOC_1_FILLER_3,
DOC_1_FILLER_4,
],
combined_content="N/A",
),
InferenceSection(
inference_section_from_chunks(
center_chunk=DOC_1_BOTTOM_CHUNK,
chunks=[
DOC_1_FILLER_3,
@@ -142,9 +141,8 @@ Content 17
DOC_1_FILLER_5,
DOC_1_FILLER_6,
],
combined_content="N/A",
),
InferenceSection(
inference_section_from_chunks(
center_chunk=DOC_1_DISCONNECTED,
chunks=[
DOC_1_FILLER_7,
@@ -153,9 +151,8 @@ Content 17
DOC_1_FILLER_9,
DOC_1_FILLER_10,
],
combined_content="N/A",
),
InferenceSection(
inference_section_from_chunks(
center_chunk=DOC_2_TOP_CHUNK,
chunks=[
DOC_2_FILLER_1,
@@ -164,9 +161,8 @@ Content 17
DOC_2_FILLER_3,
DOC_2_BOTTOM_CHUNK,
],
combined_content="N/A",
),
InferenceSection(
inference_section_from_chunks(
center_chunk=DOC_2_BOTTOM_CHUNK,
chunks=[
DOC_2_TOP_CHUNK,
@@ -175,7 +171,6 @@ Content 17
DOC_2_FILLER_4,
DOC_2_FILLER_5,
],
combined_content="N/A",
),
],
# Expected Content
@@ -204,15 +199,13 @@ def test_merge_sections(
(
# Sections
[
InferenceSection(
inference_section_from_chunks(
center_chunk=DOC_1_TOP_CHUNK,
chunks=[DOC_1_TOP_CHUNK],
combined_content="N/A", # Not used
),
InferenceSection(
inference_section_from_chunks(
center_chunk=DOC_1_MID_CHUNK,
chunks=[DOC_1_MID_CHUNK],
combined_content="N/A",
),
],
# Expected Content

View File

@@ -113,15 +113,18 @@ _VALID_SALESFORCE_IDS = [
]
def _clear_sf_db() -> None:
def _clear_sf_db(directory: str) -> None:
"""
Clears the SF DB by deleting all files in the data directory.
"""
shutil.rmtree(BASE_DATA_PATH, ignore_errors=True)
shutil.rmtree(directory, ignore_errors=True)
def _create_csv_file(
object_type: str, records: list[dict], filename: str = "test_data.csv"
directory: str,
object_type: str,
records: list[dict],
filename: str = "test_data.csv",
) -> None:
"""
Creates a CSV file for the given object type and records.
@@ -149,10 +152,10 @@ def _create_csv_file(
writer.writerow(record)
# Update the database with the CSV
update_sf_db_with_csv(object_type, csv_path)
update_sf_db_with_csv(directory, object_type, csv_path)
def _create_csv_with_example_data() -> None:
def _create_csv_with_example_data(directory: str) -> None:
"""
Creates CSV files with example data, organized by object type.
"""
@@ -342,10 +345,10 @@ def _create_csv_with_example_data() -> None:
# Create CSV files for each object type
for object_type, records in example_data.items():
_create_csv_file(object_type, records)
_create_csv_file(directory, object_type, records)
def _test_query() -> None:
def _test_query(directory: str) -> None:
"""
Tests querying functionality by verifying:
1. All expected Account IDs are found
@@ -401,7 +404,7 @@ def _test_query() -> None:
}
# Get all Account IDs
account_ids = find_ids_by_type("Account")
account_ids = find_ids_by_type(directory, "Account")
# Verify we found all expected accounts
assert len(account_ids) == len(
@@ -413,7 +416,7 @@ def _test_query() -> None:
# Verify each account's data
for acc_id in account_ids:
combined = get_record(acc_id)
combined = get_record(directory, acc_id)
assert combined is not None, f"Could not find account {acc_id}"
expected = expected_accounts[acc_id]
@@ -428,7 +431,7 @@ def _test_query() -> None:
print("All query tests passed successfully!")
def _test_upsert() -> None:
def _test_upsert(directory: str) -> None:
"""
Tests upsert functionality by:
1. Updating an existing account
@@ -453,10 +456,10 @@ def _test_upsert() -> None:
},
]
_create_csv_file("Account", update_data, "update_data.csv")
_create_csv_file(directory, "Account", update_data, "update_data.csv")
# Verify the update worked
updated_record = get_record(_VALID_SALESFORCE_IDS[0])
updated_record = get_record(directory, _VALID_SALESFORCE_IDS[0])
assert updated_record is not None, "Updated record not found"
assert updated_record.data["Name"] == "Acme Inc. Updated", "Name not updated"
assert (
@@ -464,7 +467,7 @@ def _test_upsert() -> None:
), "Description not added"
# Verify the new record was created
new_record = get_record(_VALID_SALESFORCE_IDS[2])
new_record = get_record(directory, _VALID_SALESFORCE_IDS[2])
assert new_record is not None, "New record not found"
assert new_record.data["Name"] == "New Company Inc.", "New record name incorrect"
assert new_record.data["AnnualRevenue"] == "1000000", "New record revenue incorrect"
@@ -472,7 +475,7 @@ def _test_upsert() -> None:
print("All upsert tests passed successfully!")
def _test_relationships() -> None:
def _test_relationships(directory: str) -> None:
"""
Tests relationship shelf updates and queries by:
1. Creating test data with relationships
@@ -513,11 +516,11 @@ def _test_relationships() -> None:
# Create and update CSV files for each object type
for object_type, records in test_data.items():
_create_csv_file(object_type, records, "relationship_test.csv")
_create_csv_file(directory, object_type, records, "relationship_test.csv")
# Test relationship queries
# All these objects should be children of Acme Inc.
child_ids = get_child_ids(_VALID_SALESFORCE_IDS[0])
child_ids = get_child_ids(directory, _VALID_SALESFORCE_IDS[0])
assert len(child_ids) == 4, f"Expected 4 child objects, found {len(child_ids)}"
assert _VALID_SALESFORCE_IDS[13] in child_ids, "Case 1 not found in relationship"
assert _VALID_SALESFORCE_IDS[14] in child_ids, "Case 2 not found in relationship"
@@ -527,7 +530,7 @@ def _test_relationships() -> None:
), "Opportunity not found in relationship"
# Test querying relationships for a different account (should be empty)
other_account_children = get_child_ids(_VALID_SALESFORCE_IDS[1])
other_account_children = get_child_ids(directory, _VALID_SALESFORCE_IDS[1])
assert (
len(other_account_children) == 0
), "Expected no children for different account"
@@ -535,7 +538,7 @@ def _test_relationships() -> None:
print("All relationship tests passed successfully!")
def _test_account_with_children() -> None:
def _test_account_with_children(directory: str) -> None:
"""
Tests querying all accounts and retrieving their child objects.
This test verifies that:
@@ -544,16 +547,16 @@ def _test_account_with_children() -> None:
3. Child object data is complete and accurate
"""
# First get all account IDs
account_ids = find_ids_by_type("Account")
account_ids = find_ids_by_type(directory, "Account")
assert len(account_ids) > 0, "No accounts found"
# For each account, get its children and verify the data
for account_id in account_ids:
account = get_record(account_id)
account = get_record(directory, account_id)
assert account is not None, f"Could not find account {account_id}"
# Get all child objects
child_ids = get_child_ids(account_id)
child_ids = get_child_ids(directory, account_id)
# For Acme Inc., verify specific relationships
if account_id == _VALID_SALESFORCE_IDS[0]: # Acme Inc.
@@ -564,7 +567,7 @@ def _test_account_with_children() -> None:
# Get all child records
child_records = []
for child_id in child_ids:
child_record = get_record(child_id)
child_record = get_record(directory, child_id)
if child_record is not None:
child_records.append(child_record)
# Verify Cases
@@ -599,7 +602,7 @@ def _test_account_with_children() -> None:
print("All account with children tests passed successfully!")
def _test_relationship_updates() -> None:
def _test_relationship_updates(directory: str) -> None:
"""
Tests that relationships are properly updated when a child object's parent reference changes.
This test verifies:
@@ -616,10 +619,10 @@ def _test_relationship_updates() -> None:
"LastName": "Contact",
}
]
_create_csv_file("Contact", initial_contact, "initial_contact.csv")
_create_csv_file(directory, "Contact", initial_contact, "initial_contact.csv")
# Verify initial relationship
acme_children = get_child_ids(_VALID_SALESFORCE_IDS[0])
acme_children = get_child_ids(directory, _VALID_SALESFORCE_IDS[0])
assert (
_VALID_SALESFORCE_IDS[40] in acme_children
), "Initial relationship not created"
@@ -633,22 +636,22 @@ def _test_relationship_updates() -> None:
"LastName": "Contact",
}
]
_create_csv_file("Contact", updated_contact, "updated_contact.csv")
_create_csv_file(directory, "Contact", updated_contact, "updated_contact.csv")
# Verify old relationship is removed
acme_children = get_child_ids(_VALID_SALESFORCE_IDS[0])
acme_children = get_child_ids(directory, _VALID_SALESFORCE_IDS[0])
assert (
_VALID_SALESFORCE_IDS[40] not in acme_children
), "Old relationship not removed"
# Verify new relationship is created
globex_children = get_child_ids(_VALID_SALESFORCE_IDS[1])
globex_children = get_child_ids(directory, _VALID_SALESFORCE_IDS[1])
assert _VALID_SALESFORCE_IDS[40] in globex_children, "New relationship not created"
print("All relationship update tests passed successfully!")
def _test_get_affected_parent_ids() -> None:
def _test_get_affected_parent_ids(directory: str) -> None:
"""
Tests get_affected_parent_ids functionality by verifying:
1. IDs that are directly in the parent_types list are included
@@ -683,13 +686,13 @@ def _test_get_affected_parent_ids() -> None:
# Create and update CSV files for test data
for object_type, records in test_data.items():
_create_csv_file(object_type, records)
_create_csv_file(directory, object_type, records)
# Test Case 1: Account directly in updated_ids and parent_types
updated_ids = [_VALID_SALESFORCE_IDS[1]] # Parent Account 2
parent_types = ["Account"]
affected_ids_by_type = dict(
get_affected_parent_ids_by_type(updated_ids, parent_types)
get_affected_parent_ids_by_type(directory, updated_ids, parent_types)
)
assert "Account" in affected_ids_by_type, "Account type not in affected_ids_by_type"
assert (
@@ -700,7 +703,7 @@ def _test_get_affected_parent_ids() -> None:
updated_ids = [_VALID_SALESFORCE_IDS[40]] # Child Contact
parent_types = ["Account"]
affected_ids_by_type = dict(
get_affected_parent_ids_by_type(updated_ids, parent_types)
get_affected_parent_ids_by_type(directory, updated_ids, parent_types)
)
assert "Account" in affected_ids_by_type, "Account type not in affected_ids_by_type"
assert (
@@ -711,7 +714,7 @@ def _test_get_affected_parent_ids() -> None:
updated_ids = [_VALID_SALESFORCE_IDS[1], _VALID_SALESFORCE_IDS[40]] # Both cases
parent_types = ["Account"]
affected_ids_by_type = dict(
get_affected_parent_ids_by_type(updated_ids, parent_types)
get_affected_parent_ids_by_type(directory, updated_ids, parent_types)
)
assert "Account" in affected_ids_by_type, "Account type not in affected_ids_by_type"
affected_ids = affected_ids_by_type["Account"]
@@ -726,7 +729,7 @@ def _test_get_affected_parent_ids() -> None:
updated_ids = [_VALID_SALESFORCE_IDS[40]] # Child Contact
parent_types = ["Opportunity"] # Wrong type
affected_ids_by_type = dict(
get_affected_parent_ids_by_type(updated_ids, parent_types)
get_affected_parent_ids_by_type(directory, updated_ids, parent_types)
)
assert len(affected_ids_by_type) == 0, "Should return empty dict when no matches"
@@ -734,13 +737,15 @@ def _test_get_affected_parent_ids() -> None:
def test_salesforce_sqlite() -> None:
_clear_sf_db()
init_db()
_create_csv_with_example_data()
_test_query()
_test_upsert()
_test_relationships()
_test_account_with_children()
_test_relationship_updates()
_test_get_affected_parent_ids()
_clear_sf_db()
directory = BASE_DATA_PATH
_clear_sf_db(directory)
init_db(directory)
_create_csv_with_example_data(directory)
_test_query(directory)
_test_upsert(directory)
_test_relationships(directory)
_test_account_with_children(directory)
_test_relationship_updates(directory)
_test_get_affected_parent_ids(directory)
_clear_sf_db(directory)

View File

@@ -0,0 +1,208 @@
import os
from unittest.mock import MagicMock
from unittest.mock import patch
import pytest
from onyx.configs.constants import DocumentSource
from onyx.context.search.models import InferenceChunk
from onyx.db.models import User
from onyx.utils.variable_functionality import fetch_ee_implementation_or_noop
_post_query_chunk_censoring = fetch_ee_implementation_or_noop(
"onyx.external_permissions.post_query_censoring", "_post_query_chunk_censoring"
)
@pytest.mark.skipif(
os.environ.get("ENABLE_PAID_ENTERPRISE_EDITION_FEATURES", "").lower() != "true",
reason="Permissions tests are enterprise only",
)
class TestPostQueryChunkCensoring:
@pytest.fixture(autouse=True)
def setUp(self) -> None:
self.mock_user = User(id=1, email="test@example.com")
self.mock_chunk_1 = InferenceChunk(
document_id="doc1",
chunk_id=1,
content="chunk1 content",
source_type=DocumentSource.SALESFORCE,
semantic_identifier="doc1_1",
title="doc1",
boost=1,
recency_bias=1.0,
score=0.9,
hidden=False,
metadata={},
match_highlights=[],
doc_summary="doc1 summary",
chunk_context="doc1 context",
updated_at=None,
image_file_name=None,
source_links={},
section_continuation=False,
blurb="chunk1",
)
self.mock_chunk_2 = InferenceChunk(
document_id="doc2",
chunk_id=2,
content="chunk2 content",
source_type=DocumentSource.SLACK,
semantic_identifier="doc2_2",
title="doc2",
boost=1,
recency_bias=1.0,
score=0.8,
hidden=False,
metadata={},
match_highlights=[],
doc_summary="doc2 summary",
chunk_context="doc2 context",
updated_at=None,
image_file_name=None,
source_links={},
section_continuation=False,
blurb="chunk2",
)
self.mock_chunk_3 = InferenceChunk(
document_id="doc3",
chunk_id=3,
content="chunk3 content",
source_type=DocumentSource.SALESFORCE,
semantic_identifier="doc3_3",
title="doc3",
boost=1,
recency_bias=1.0,
score=0.7,
hidden=False,
metadata={},
match_highlights=[],
doc_summary="doc3 summary",
chunk_context="doc3 context",
updated_at=None,
image_file_name=None,
source_links={},
section_continuation=False,
blurb="chunk3",
)
self.mock_chunk_4 = InferenceChunk(
document_id="doc4",
chunk_id=4,
content="chunk4 content",
source_type=DocumentSource.SALESFORCE,
semantic_identifier="doc4_4",
title="doc4",
boost=1,
recency_bias=1.0,
score=0.6,
hidden=False,
metadata={},
match_highlights=[],
doc_summary="doc4 summary",
chunk_context="doc4 context",
updated_at=None,
image_file_name=None,
source_links={},
section_continuation=False,
blurb="chunk4",
)
@patch(
"ee.onyx.external_permissions.post_query_censoring._get_all_censoring_enabled_sources"
)
def test_post_query_chunk_censoring_no_user(
self, mock_get_sources: MagicMock
) -> None:
mock_get_sources.return_value = {DocumentSource.SALESFORCE}
chunks = [self.mock_chunk_1, self.mock_chunk_2]
result = _post_query_chunk_censoring(chunks, None)
assert result == chunks
@patch(
"ee.onyx.external_permissions.post_query_censoring._get_all_censoring_enabled_sources"
)
@patch(
"ee.onyx.external_permissions.post_query_censoring.DOC_SOURCE_TO_CHUNK_CENSORING_FUNCTION"
)
def test_post_query_chunk_censoring_salesforce_censored(
self, mock_censor_func: MagicMock, mock_get_sources: MagicMock
) -> None:
mock_get_sources.return_value = {DocumentSource.SALESFORCE}
mock_censor_func_impl = MagicMock(
return_value=[self.mock_chunk_1]
) # Only return chunk 1
mock_censor_func.__getitem__.return_value = mock_censor_func_impl
chunks = [self.mock_chunk_1, self.mock_chunk_2, self.mock_chunk_3]
result = _post_query_chunk_censoring(chunks, self.mock_user)
assert len(result) == 2
assert self.mock_chunk_1 in result
assert self.mock_chunk_2 in result
assert self.mock_chunk_3 not in result
mock_censor_func_impl.assert_called_once()
@patch(
"ee.onyx.external_permissions.post_query_censoring._get_all_censoring_enabled_sources"
)
@patch(
"ee.onyx.external_permissions.post_query_censoring.DOC_SOURCE_TO_CHUNK_CENSORING_FUNCTION"
)
def test_post_query_chunk_censoring_salesforce_error(
self, mock_censor_func: MagicMock, mock_get_sources: MagicMock
) -> None:
mock_get_sources.return_value = {DocumentSource.SALESFORCE}
mock_censor_func_impl = MagicMock(side_effect=Exception("Censoring error"))
mock_censor_func.__getitem__.return_value = mock_censor_func_impl
chunks = [self.mock_chunk_1, self.mock_chunk_2, self.mock_chunk_3]
result = _post_query_chunk_censoring(chunks, self.mock_user)
assert len(result) == 1
assert self.mock_chunk_2 in result
mock_censor_func_impl.assert_called_once()
@patch(
"ee.onyx.external_permissions.post_query_censoring._get_all_censoring_enabled_sources"
)
@patch(
"ee.onyx.external_permissions.post_query_censoring.DOC_SOURCE_TO_CHUNK_CENSORING_FUNCTION"
)
def test_post_query_chunk_censoring_no_censoring(
self, mock_censor_func: MagicMock, mock_get_sources: MagicMock
) -> None:
mock_get_sources.return_value = set() # No sources to censor
mock_censor_func_impl = MagicMock()
mock_censor_func.__getitem__.return_value = mock_censor_func_impl
chunks = [self.mock_chunk_1, self.mock_chunk_2, self.mock_chunk_3]
result = _post_query_chunk_censoring(chunks, self.mock_user)
assert result == chunks
mock_censor_func_impl.assert_not_called()
@patch(
"ee.onyx.external_permissions.post_query_censoring._get_all_censoring_enabled_sources"
)
@patch(
"ee.onyx.external_permissions.post_query_censoring.DOC_SOURCE_TO_CHUNK_CENSORING_FUNCTION"
)
def test_post_query_chunk_censoring_order_maintained(
self, mock_censor_func: MagicMock, mock_get_sources: MagicMock
) -> None:
mock_get_sources.return_value = {DocumentSource.SALESFORCE}
mock_censor_func_impl = MagicMock(
return_value=[self.mock_chunk_3, self.mock_chunk_1]
) # Return chunk 3 and 1
mock_censor_func.__getitem__.return_value = mock_censor_func_impl
chunks = [
self.mock_chunk_1,
self.mock_chunk_2,
self.mock_chunk_3,
self.mock_chunk_4,
]
result = _post_query_chunk_censoring(chunks, self.mock_user)
assert len(result) == 3
assert result[0] == self.mock_chunk_1
assert result[1] == self.mock_chunk_2
assert result[2] == self.mock_chunk_3
assert self.mock_chunk_4 not in result
mock_censor_func_impl.assert_called_once()

View File

@@ -0,0 +1,270 @@
from datetime import datetime
from datetime import timedelta
from datetime import timezone
from onyx.configs.constants import DocumentSource
from onyx.configs.constants import INDEX_SEPARATOR
from onyx.context.search.models import IndexFilters
from onyx.context.search.models import Tag
from onyx.document_index.vespa.shared_utils.vespa_request_builders import (
build_vespa_filters,
)
from onyx.document_index.vespa_constants import DOC_UPDATED_AT
from onyx.document_index.vespa_constants import DOCUMENT_SETS
from onyx.document_index.vespa_constants import HIDDEN
from onyx.document_index.vespa_constants import METADATA_LIST
from onyx.document_index.vespa_constants import SOURCE_TYPE
from onyx.document_index.vespa_constants import TENANT_ID
from onyx.document_index.vespa_constants import USER_FILE
from onyx.document_index.vespa_constants import USER_FOLDER
from shared_configs.configs import MULTI_TENANT
# Import the function under test
class TestBuildVespaFilters:
def test_empty_filters(self) -> None:
"""Test with empty filters object."""
filters = IndexFilters(access_control_list=[])
result = build_vespa_filters(filters)
assert result == f"!({HIDDEN}=true) and "
# With trailing AND removed
result = build_vespa_filters(filters, remove_trailing_and=True)
assert result == f"!({HIDDEN}=true)"
def test_include_hidden(self) -> None:
"""Test with include_hidden flag."""
filters = IndexFilters(access_control_list=[])
result = build_vespa_filters(filters, include_hidden=True)
assert result == "" # No filters applied when including hidden
# With some other filter to ensure proper AND chaining
filters = IndexFilters(access_control_list=[], source_type=[DocumentSource.WEB])
result = build_vespa_filters(filters, include_hidden=True)
assert result == f'({SOURCE_TYPE} contains "web") and '
def test_acl(self) -> None:
"""Test with acls."""
# Single ACL
filters = IndexFilters(access_control_list=["user1"])
result = build_vespa_filters(filters)
assert (
result
== f'!({HIDDEN}=true) and (access_control_list contains "user1") and '
)
# Multiple ACL's
filters = IndexFilters(access_control_list=["user2", "group2"])
result = build_vespa_filters(filters)
assert (
result
== f'!({HIDDEN}=true) and (access_control_list contains "user2" or access_control_list contains "group2") and '
)
def test_tenant_filter(self) -> None:
"""Test tenant ID filtering."""
# With tenant ID
if MULTI_TENANT:
filters = IndexFilters(access_control_list=[], tenant_id="tenant1")
result = build_vespa_filters(filters)
assert (
f'!({HIDDEN}=true) and ({TENANT_ID} contains "tenant1") and ' == result
)
# No tenant ID
filters = IndexFilters(access_control_list=[], tenant_id=None)
result = build_vespa_filters(filters)
assert f"!({HIDDEN}=true) and " == result
def test_source_type_filter(self) -> None:
"""Test source type filtering."""
# Single source type
filters = IndexFilters(access_control_list=[], source_type=[DocumentSource.WEB])
result = build_vespa_filters(filters)
assert f'!({HIDDEN}=true) and ({SOURCE_TYPE} contains "web") and ' == result
# Multiple source types
filters = IndexFilters(
access_control_list=[],
source_type=[DocumentSource.WEB, DocumentSource.JIRA],
)
result = build_vespa_filters(filters)
assert (
f'!({HIDDEN}=true) and ({SOURCE_TYPE} contains "web" or {SOURCE_TYPE} contains "jira") and '
== result
)
# Empty source type list
filters = IndexFilters(access_control_list=[], source_type=[])
result = build_vespa_filters(filters)
assert f"!({HIDDEN}=true) and " == result
def test_tag_filters(self) -> None:
"""Test tag filtering."""
# Single tag
filters = IndexFilters(
access_control_list=[], tags=[Tag(tag_key="color", tag_value="red")]
)
result = build_vespa_filters(filters)
assert (
f'!({HIDDEN}=true) and ({METADATA_LIST} contains "color{INDEX_SEPARATOR}red") and '
== result
)
# Multiple tags
filters = IndexFilters(
access_control_list=[],
tags=[
Tag(tag_key="color", tag_value="red"),
Tag(tag_key="size", tag_value="large"),
],
)
result = build_vespa_filters(filters)
expected = (
f'!({HIDDEN}=true) and ({METADATA_LIST} contains "color{INDEX_SEPARATOR}red" '
f'or {METADATA_LIST} contains "size{INDEX_SEPARATOR}large") and '
)
assert expected == result
# Empty tags list
filters = IndexFilters(access_control_list=[], tags=[])
result = build_vespa_filters(filters)
assert f"!({HIDDEN}=true) and " == result
def test_document_sets_filter(self) -> None:
"""Test document sets filtering."""
# Single document set
filters = IndexFilters(access_control_list=[], document_set=["set1"])
result = build_vespa_filters(filters)
assert f'!({HIDDEN}=true) and ({DOCUMENT_SETS} contains "set1") and ' == result
# Multiple document sets
filters = IndexFilters(access_control_list=[], document_set=["set1", "set2"])
result = build_vespa_filters(filters)
assert (
f'!({HIDDEN}=true) and ({DOCUMENT_SETS} contains "set1" or {DOCUMENT_SETS} contains "set2") and '
== result
)
# Empty document sets
filters = IndexFilters(access_control_list=[], document_set=[])
result = build_vespa_filters(filters)
assert f"!({HIDDEN}=true) and " == result
def test_user_file_ids_filter(self) -> None:
"""Test user file IDs filtering."""
# Single user file ID
filters = IndexFilters(access_control_list=[], user_file_ids=[123])
result = build_vespa_filters(filters)
assert f"!({HIDDEN}=true) and ({USER_FILE} = 123) and " == result
# Multiple user file IDs
filters = IndexFilters(access_control_list=[], user_file_ids=[123, 456])
result = build_vespa_filters(filters)
assert (
f"!({HIDDEN}=true) and ({USER_FILE} = 123 or {USER_FILE} = 456) and "
== result
)
# Empty user file IDs
filters = IndexFilters(access_control_list=[], user_file_ids=[])
result = build_vespa_filters(filters)
assert f"!({HIDDEN}=true) and " == result
def test_user_folder_ids_filter(self) -> None:
"""Test user folder IDs filtering."""
# Single user folder ID
filters = IndexFilters(access_control_list=[], user_folder_ids=[789])
result = build_vespa_filters(filters)
assert f"!({HIDDEN}=true) and ({USER_FOLDER} = 789) and " == result
# Multiple user folder IDs
filters = IndexFilters(access_control_list=[], user_folder_ids=[789, 101])
result = build_vespa_filters(filters)
assert (
f"!({HIDDEN}=true) and ({USER_FOLDER} = 789 or {USER_FOLDER} = 101) and "
== result
)
# Empty user folder IDs
filters = IndexFilters(access_control_list=[], user_folder_ids=[])
result = build_vespa_filters(filters)
assert f"!({HIDDEN}=true) and " == result
def test_time_cutoff_filter(self) -> None:
"""Test time cutoff filtering."""
# With cutoff time
cutoff_time = datetime(2023, 1, 1, tzinfo=timezone.utc)
filters = IndexFilters(access_control_list=[], time_cutoff=cutoff_time)
result = build_vespa_filters(filters)
cutoff_secs = int(cutoff_time.timestamp())
assert (
f"!({HIDDEN}=true) and !({DOC_UPDATED_AT} < {cutoff_secs}) and " == result
)
# No cutoff time
filters = IndexFilters(access_control_list=[], time_cutoff=None)
result = build_vespa_filters(filters)
assert f"!({HIDDEN}=true) and " == result
# Test untimed logic (when cutoff is old enough)
old_cutoff = datetime.now(timezone.utc) - timedelta(days=100)
filters = IndexFilters(access_control_list=[], time_cutoff=old_cutoff)
result = build_vespa_filters(filters)
old_cutoff_secs = int(old_cutoff.timestamp())
assert (
f"!({HIDDEN}=true) and !({DOC_UPDATED_AT} < {old_cutoff_secs}) and "
== result
)
def test_combined_filters(self) -> None:
"""Test combining multiple filter types."""
filters = IndexFilters(
access_control_list=["user1", "group1"],
source_type=[DocumentSource.WEB],
tags=[Tag(tag_key="color", tag_value="red")],
document_set=["set1"],
user_file_ids=[123],
user_folder_ids=[789],
time_cutoff=datetime(2023, 1, 1, tzinfo=timezone.utc),
)
result = build_vespa_filters(filters)
# Build expected result piece by piece for readability
expected = f"!({HIDDEN}=true) and "
expected += (
'(access_control_list contains "user1" or '
'access_control_list contains "group1") and '
)
expected += f'({SOURCE_TYPE} contains "web") and '
expected += f'({METADATA_LIST} contains "color{INDEX_SEPARATOR}red") and '
expected += f'({DOCUMENT_SETS} contains "set1") and '
expected += f"({USER_FILE} = 123) and "
expected += f"({USER_FOLDER} = 789) and "
cutoff_secs = int(datetime(2023, 1, 1, tzinfo=timezone.utc).timestamp())
expected += f"!({DOC_UPDATED_AT} < {cutoff_secs}) and "
assert expected == result
# With trailing AND removed
result_no_trailing = build_vespa_filters(filters, remove_trailing_and=True)
assert expected[:-5] == result_no_trailing # Remove trailing " and "
def test_empty_or_none_values(self) -> None:
"""Test with empty or None values in filter lists."""
# Empty strings in document set
filters = IndexFilters(
access_control_list=[], document_set=["set1", "", "set2"]
)
result = build_vespa_filters(filters)
assert (
f'!({HIDDEN}=true) and ({DOCUMENT_SETS} contains "set1" or {DOCUMENT_SETS} contains "set2") and '
== result
)
# All empty strings in document set
filters = IndexFilters(access_control_list=[], document_set=["", ""])
result = build_vespa_filters(filters)
assert f"!({HIDDEN}=true) and " == result

529
company_links.csv Normal file
View File

@@ -0,0 +1,529 @@
Company,Link
1849-bio,https://x.com/1849bio
1stcollab,https://twitter.com/ycombinator
abundant,https://x.com/abundant_labs
activepieces,https://mobile.twitter.com/mabuaboud
acx,https://twitter.com/ycombinator
adri-ai,https://twitter.com/darshitac_
affil-ai,https://twitter.com/ycombinator
agave,https://twitter.com/moyicat
aglide,https://twitter.com/pdmcguckian
ai-2,https://twitter.com/the_yuppy
ai-sell,https://x.com/liuzjerry
airtrain-ai,https://twitter.com/neutralino1
aisdr,https://twitter.com/YuriyZaremba
alex,https://x.com/DanielEdrisian
alga-biosciences,https://twitter.com/algabiosciences
alguna,https://twitter.com/aleks_djekic
alixia,https://twitter.com/ycombinator
aminoanalytica,https://x.com/lilwuuzivert
anara,https://twitter.com/naveedjanmo
andi,https://twitter.com/MiamiAngela
andoria,https://x.com/dbudimane
andromeda-surgical,https://twitter.com/nickdamian0
anglera,https://twitter.com/ycombinator
angstrom-ai,https://twitter.com/JaviAC7
ankr-health,https://twitter.com/Ankr_us
apoxy,https://twitter.com/ycombinator
apten,https://twitter.com/dho1357
aragorn-ai,https://twitter.com/ycombinator
arc-2,https://twitter.com/DarkMirage
archilabs,https://twitter.com/ycombinator
arcimus,https://twitter.com/husseinsyed73
argovox,https://www.argovox.com/
artemis-search,https://twitter.com/ycombinator
artie,https://x.com/JacquelineSYC19
asklio,https://twitter.com/butterflock
atlas-2,https://twitter.com/jobryan
attain,https://twitter.com/aamir_hudda
autocomputer,https://twitter.com/madhavsinghal_
automat,https://twitter.com/lucas0choa
automorphic,https://twitter.com/sandkoan
autopallet-robotics,https://twitter.com/ycombinator
autumn-labs,https://twitter.com/ycombinator
aviary,https://twitter.com/ycombinator
azuki,https://twitter.com/VamptVo
banabo,https://twitter.com/ycombinator
baseline-ai,https://twitter.com/ycombinator
baserun,https://twitter.com/effyyzhang
benchify,https://www.x.com/maxvonhippel
berry,https://twitter.com/annchanyt
bifrost,https://twitter.com/0xMysterious
bifrost-orbital,https://x.com/ionkarbatra
biggerpicture,https://twitter.com/ycombinator
biocartesian,https://twitter.com/ycombinator
bland-ai,https://twitter.com/zaygranet
blast,https://x.com/useblast
blaze,https://twitter.com/larfy_rothwell
bluebirds,https://twitter.com/RohanPunamia
bluedot,https://twitter.com/selinayfilizp
bluehill-payments,https://twitter.com/HimanshuMinocha
blyss,https://twitter.com/blyssdev
bolto,https://twitter.com/mrinalsingh02?lang=en
botcity,https://twitter.com/lorhancaproni
boundo,https://twitter.com/ycombinator
bramble,https://x.com/meksikanpijha
bricksai,https://twitter.com/ycombinator
broccoli-ai,https://twitter.com/abhishekjain25
bronco-ai,https://twitter.com/dluozhang
bunting-labs,https://twitter.com/normconstant
byterat,https://twitter.com/penelopekjones_
callback,https://twitter.com/ycombinator
cambio-2,https://twitter.com/ycombinator
camfer,https://x.com/AryaBastani
campfire-2,https://twitter.com/ycombinator
campfire-applied-ai-company,https://twitter.com/siamakfr
candid,https://x.com/kesavkosana
canvas,https://x.com/essamsleiman
capsule,https://twitter.com/kelsey_pedersen
cardinal,http://twitter.com/nadavwiz
cardinal-gray,https://twitter.com/ycombinator
cargo,https://twitter.com/aureeaubert
cartage,https://twitter.com/ycombinator
cashmere,https://twitter.com/shashankbuilds
cedalio,https://twitter.com/LucianaReznik
cekura-2,https://x.com/tarush_agarwal_
central,https://twitter.com/nilaymod
champ,https://twitter.com/ycombinator
cheers,https://twitter.com/ycombinator
chequpi,https://twitter.com/sudshekhar02
chima,https://twitter.com/nikharanirghin
cinapse,https://www.twitter.com/hgphillipsiv
ciro,https://twitter.com/davidjwiner
clara,https://x.com/levinsonjon
cleancard,https://twitter.com/_tom_dot_com
clearspace,https://twitter.com/rbfasho
cobbery,https://twitter.com/Dan_The_Goodman
codeviz,https://x.com/liam_prev
coil-inc,https://twitter.com/ycombinator
coldreach,https://twitter.com/ycombinator
combinehealth,https://twitter.com/ycombinator
comfy-deploy,https://twitter.com/nicholaskkao
complete,https://twitter.com/ranimavram
conductor-quantum,https://twitter.com/BrandonSeverin
conduit,https://twitter.com/ycombinator
continue,https://twitter.com/tylerjdunn
contour,https://twitter.com/ycombinator
coperniq,https://twitter.com/abdullahzandani
corgea,https://twitter.com/asadeddin
corgi,https://twitter.com/nico_laqua?lang=en
corgi-labs,https://twitter.com/ycombinator
coris,https://twitter.com/psvinodh
cosine,https://twitter.com/AlistairPullen
courtyard-io,https://twitter.com/lejeunedall
coverage-cat,https://twitter.com/coveragecats
craftos,https://twitter.com/wa3l
craniometrix,https://craniometrix.com
ctgt,https://twitter.com/cyrilgorlla
curo,https://x.com/EnergizedAndrew
dagworks-inc,https://twitter.com/dagworks
dart,https://twitter.com/milad3malek
dashdive,https://twitter.com/micahawheat
dataleap,https://twitter.com/jh_damm
decisional-ai,https://x.com/groovetandon
decoda-health,https://twitter.com/ycombinator
deepsilicon,https://x.com/abhireddy2004
delfino-ai,https://twitter.com/ycombinator
demo-gorilla,https://twitter.com/ycombinator
demospace,https://www.twitter.com/nick_fiacco
dench-com,https://www.twitter.com/markrachapoom
denormalized,https://twitter.com/IAmMattGreen
dev-tools-ai,https://twitter.com/ycombinator
diffusion-studio,https://x.com/MatthiasRuiz22
digitalcarbon,https://x.com/CtrlGuruDelete
dimely,https://x.com/UseDimely
disputeninja,https://twitter.com/legitmaxwu
diversion,https://twitter.com/sasham1
dmodel,https://twitter.com/dmooooon
doctor-droid,https://twitter.com/TheBengaluruGuy
dodo,https://x.com/dominik_moehrle
dojah-inc,https://twitter.com/ololaday
domu-technology-inc,https://twitter.com/ycombinator
dr-treat,https://twitter.com/rakeshtondon
dreamrp,https://x.com/dreamrpofficial
drivingforce,https://twitter.com/drivingforcehq
dynamo-ai,https://twitter.com/dynamo_fl
edgebit,https://twitter.com/robszumski
educato-ai,https://x.com/FelixGabler
electric-air-2,https://twitter.com/JezOsborne
ember,https://twitter.com/hsinleiwang
ember-robotics,https://twitter.com/ycombinator
emergent,https://twitter.com/mukundjha
emobi,https://twitter.com/ycombinator
entangl,https://twitter.com/Shapol_m
envelope,https://twitter.com/joshuakcockrell
et-al,https://twitter.com/ycombinator
eugit-therapeutics,http://www.eugittx.com
eventual,https://twitter.com/sammy_sidhu
evoly,https://twitter.com/ycombinator
expand-ai,https://twitter.com/timsuchanek
ezdubs,https://twitter.com/PadmanabhanKri
fabius,https://twitter.com/adayNU
fazeshift,https://twitter.com/ycombinator
felafax,https://twitter.com/ThatNithin
fetchr,https://twitter.com/CalvinnChenn
fiber-ai,https://twitter.com/AdiAgashe
ficra,https://x.com/ficra_ai
fiddlecube,https://twitter.com/nupoor_neha
finic,https://twitter.com/jfan001
finta,https://www.twitter.com/andywang
fintool,https://twitter.com/nicbstme
finvest,https://twitter.com/shivambharuka
firecrawl,https://x.com/ericciarla
firstwork,https://twitter.com/techie_Shubham
fixa,https://x.com/jonathanzliu
flair-health,https://twitter.com/adivawhocodes
fleek,https://twitter.com/ycombinator
fleetworks,https://twitter.com/ycombinator
flike,https://twitter.com/yajmch
flint-2,https://twitter.com/hungrysohan
floworks,https://twitter.com/sarthaks92
focus-buddy,https://twitter.com/yash14700/
forerunner-ai,https://x.com/willnida0
founders,https://twitter.com/ycombinator
foundry,https://x.com/FoundryAI_
freestyle,https://x.com/benswerd
fresco,https://twitter.com/ycombinator
friday,https://x.com/AllenNaliath
frigade,https://twitter.com/FrigadeHQ
futureclinic,https://twitter.com/usamasyedmd
gait,https://twitter.com/AlexYHsia
galini,https://twitter.com/ycombinator
gauge,https://twitter.com/the1024th
gecko-security,https://x.com/jjjutla
general-analysis,https://twitter.com/ycombinator
giga-ml,https://twitter.com/varunvummadi
glade,https://twitter.com/ycombinator
glass-health,https://twitter.com/dereckwpaul
goodfin,https://twitter.com/ycombinator
grai,https://twitter.com/ycombinator
greenlite,https://twitter.com/will_lawrenceTO
grey,https://www.twitter.com/kingidee
happyrobot,https://twitter.com/pablorpalafox
haystack-software,https://x.com/AkshaySubr42403
health-harbor,https://twitter.com/AlanLiu96
healthspark,https://twitter.com/stephengrinich
hedgehog-2,https://twitter.com/ycombinator
helicone,https://twitter.com/justinstorre
heroui,https://x.com/jrgarciadev
hoai,https://twitter.com/ycombinator
hockeystack,https://twitter.com/ycombinator
hokali,https://twitter.com/hokalico
homeflow,https://twitter.com/ycombinator
hubble-network,https://twitter.com/BenWild10
humand,https://twitter.com/nicolasbenenzon
humanlayer,https://twitter.com/dexhorthy
hydra,https://twitter.com/JoeSciarrino
hyperbound,https://twitter.com/sguduguntla
ideate-xyz,https://twitter.com/nomocodes
inbuild,https://twitter.com/TySharp_iB
indexical,https://twitter.com/try_nebula
industrial-next,https://twitter.com/ycombinator
infisical,https://twitter.com/matsiiako
inkeep,https://twitter.com/nickgomezc
inlet-2,https://twitter.com/inlet_ai
innkeeper,https://twitter.com/tejasybhakta
instant,https://twitter.com/JoeAverbukh
integrated-reasoning,https://twitter.com/d4r5c2
interlock,https://twitter.com/ycombinator
intryc,https://x.com/alexmarantelos?lang=en
invert,https://twitter.com/purrmin
iollo,https://twitter.com/daniel_gomari
jamble,https://twitter.com/ycombinator
joon-health,https://twitter.com/IsaacVanEaves
juicebox,https://twitter.com/davepaffenholz
julius,https://twitter.com/0interestrates
karmen,https://twitter.com/ycombinator
kenley,https://x.com/KenleyAI
keylika,https://twitter.com/buddhachaudhuri
khoj,https://twitter.com/debanjum
kite,https://twitter.com/DerekFeehrer
kivo-health,https://twitter.com/vaughnkoch
knowtex,https://twitter.com/CarolineCZhang
koala,https://twitter.com/studioseinstein?s=11
kopra-bio,https://x.com/AF_Haddad
kura,https://x.com/kura_labs
laminar,https://twitter.com/skull8888888888
lancedb,https://twitter.com/changhiskhan
latent,https://twitter.com/ycombinator
layerup,https://twitter.com/arnavbathla20
lazyeditor,https://twitter.com/jee_cash
ledgerup,https://twitter.com/josephrjohnson
lifelike,https://twitter.com/alecxiang1
lighthouz-ai,https://x.com/srijankedia
lightski,https://www.twitter.com/hansenq
ligo-biosciences,https://x.com/ArdaGoreci/status/1830744265007480934
line-build,https://twitter.com/ycombinator
lingodotdev,https://twitter.com/maxprilutskiy
linkgrep,https://twitter.com/linkgrep
linum,https://twitter.com/schopra909
livedocs,https://twitter.com/arsalanbashir
luca,https://twitter.com/LucaPricingHq
lumenary,https://twitter.com/vivekhaz
lune,https://x.com/samuelp4rk
lynx,https://twitter.com/ycombinator
magic-loops,https://twitter.com/jumploops
manaflow,https://twitter.com/austinywang
mandel-ai,https://twitter.com/shmkkr
martin,https://twitter.com/martinvoiceai
matano,https://twitter.com/AhmedSamrose
mdhub,https://twitter.com/ealamolda
mederva-health,http://twitter.com/sabihmir
medplum,https://twitter.com/ReshmaKhilnani
melty,https://x.com/charliebholtz
mem0,https://twitter.com/taranjeetio
mercator,https://www.twitter.com/ajdstein
mercoa,https://twitter.com/Sarora27
meru,https://twitter.com/rohanarora_
metalware,https://twitter.com/ryanchowww
metriport,https://twitter.com/dimagoncharov_
mica-ai,https://twitter.com/ycombinator
middleware,https://twitter.com/laduramvishnoi
midship,https://twitter.com/_kietay
mintlify,https://twitter.com/hanwangio
minusx,https://twitter.com/nuwandavek
miracle,https://twitter.com/ycombinator
miru-ml,https://twitter.com/armelwtalla
mito-health,https://twitter.com/teemingchew
mocha,https://twitter.com/nichochar
modern-realty,https://x.com/RIsanians
modulari-t,https://twitter.com/ycombinator
mogara,https://twitter.com/ycombinator
monterey-ai,https://twitter.com/chunonline
moonglow,https://twitter.com/leilavclark
moonshine,https://x.com/useMoonshine
moreta,https://twitter.com/ycombinator
mutable-ai,https://x.com/smahsramo
myria,https://twitter.com/reyflemings
nango,https://twitter.com/rguldener
nanograb,https://twitter.com/lauhoyeung
nara,https://twitter.com/join_nara
narrative,https://twitter.com/axitkhurana
nectar,https://twitter.com/AllenWang314
neosync,https://twitter.com/evisdrenova
nerve,https://x.com/fortress_build
networkocean,https://twitter.com/sammendel4
ngrow-ai,https://twitter.com/ycombinator
no-cap,https://x.com/nocapso
nowadays,https://twitter.com/ycombinator
numeral,https://www.twitter.com/mduvall_
obento-health,https://twitter.com/ycombinator
octopipe,https://twitter.com/abhishekray07
odo,https://twitter.com/ycombinator
ofone,https://twitter.com/ycombinator
onetext,http://twitter.com/jfudem
openfunnel,https://x.com/fenilsuchak
opensight,https://twitter.com/OpenSightAI
ora-ai,https://twitter.com/ryan_rl_phelps
orchid,https://twitter.com/ycombinator
origami-agents,https://x.com/fin465
outerbase,https://www.twitter.com/burcs
outerport,https://x.com/yongyuanxi
outset,https://twitter.com/AaronLCannon
overeasy,https://twitter.com/skyflylu
overlap,https://x.com/jbaerofficial
oway,https://twitter.com/owayinc
ozone,https://twitter.com/maxvwolff
pair-ai,https://twitter.com/ycombinator
palmier,https://twitter.com/ycombinator
panora,https://twitter.com/rflih_
parabolic,https://twitter.com/ycombinator
paragon-ai,https://twitter.com/ycombinator
parahelp,https://twitter.com/ankerbachryhl
parity,https://x.com/wilson_spearman
parley,https://twitter.com/ycombinator
patched,https://x.com/rohan_sood15
pearson-labs,https://twitter.com/ycombinator
pelm,https://twitter.com/ycombinator
penguin-ai,https://twitter.com/ycombinator
peoplebox,https://twitter.com/abhichugh
permitflow,https://twitter.com/ycombinator
permitportal,https://twitter.com/rgmazilu
persana-ai,https://www.twitter.com/tweetsreez
pharos,https://x.com/felix_brann
phind,https://twitter.com/michaelroyzen
phonely,https://x.com/phonely_ai
pier,https://twitter.com/ycombinator
pierre,https://twitter.com/fat
pinnacle,https://twitter.com/SeanRoades
pipeshift,https://x.com/FerraoEnrique
pivot,https://twitter.com/raimietang
planbase,https://twitter.com/ycombinator
plover-parametrics,https://twitter.com/ycombinator
plutis,https://twitter.com/kamil_m_ali
poka-labs,https://twitter.com/ycombinator
poly,https://twitter.com/Denizen_Kane
polymath-robotics,https://twitter.com/stefanesa
ponyrun,https://twitter.com/ycombinator
poplarml,https://twitter.com/dnaliu17
posh,https://twitter.com/PoshElectric
power-to-the-brand,https://twitter.com/ycombinator
primevault,https://twitter.com/prashantupd
prohostai,https://twitter.com/bilguunu
promptloop,https://twitter.com/PeterbMangan
propaya,https://x.com/PropayaOfficial
proper,https://twitter.com/kylemaloney_
proprise,https://twitter.com/kragerDev
protegee,https://x.com/kirthibanothu
pump-co,https://www.twitter.com/spndn07/
pumpkin,https://twitter.com/SamuelCrombie
pure,https://twitter.com/collectpure
pylon-2,https://x.com/marty_kausas
pyq-ai,https://twitter.com/araghuvanshi2
query-vary,https://twitter.com/DJFinetunes
rankai,https://x.com/rankai_ai
rastro,https://twitter.com/baptiste_cumin
reactwise,https://twitter.com/ycombinator
read-bean,https://twitter.com/maggieqzhang
readily,https://twitter.com/ycombinator
redouble-ai,https://twitter.com/pneumaticdill?s=21
refine,https://twitter.com/civanozseyhan
reflex,https://twitter.com/getreflex
reforged-labs,https://twitter.com/ycombinator
relace,https://twitter.com/ycombinator
relate,https://twitter.com/chrischae__
remade,https://x.com/Christos_antono
remy,https://twitter.com/ycombinator
remy-2,https://x.com/remysearch
rentflow,https://twitter.com/ycombinator
requestly,https://twitter.com/sachinjain024
resend,https://x.com/zenorocha
respaid,https://twitter.com/johnbanr
reticular,https://x.com/nithinparsan
retrofix-ai,https://twitter.com/danieldoesdev
revamp,https://twitter.com/getrevamp_ai
revyl,https://x.com/landseerenga
reworkd,https://twitter.com/asimdotshrestha
reworks,https://twitter.com/ycombinator
rift,https://twitter.com/FilipTwarowski
riskangle,https://twitter.com/ycombinator
riskcube,https://x.com/andrei_risk
rivet,https://twitter.com/nicholaskissel
riveter-ai,https://x.com/AGrillz
roame,https://x.com/timtqin
roforco,https://x.com/brain_xiang
rome,https://twitter.com/craigzLiszt
roomplays,https://twitter.com/criyaco
rosebud-biosciences,https://twitter.com/KitchenerWilson
rowboat-labs,https://twitter.com/segmenta
rubber-ducky-labs,https://twitter.com/alexandraj777
ruleset,https://twitter.com/LoganFrederick
ryvn,https://x.com/ryvnai
safetykit,https://twitter.com/ycombinator
sage-ai,https://twitter.com/akhilmurthy20
saldor,https://x.com/notblandjacob
salient,https://twitter.com/ycombinator
schemeflow,https://x.com/browninghere
sculpt,https://twitter.com/ycombinator
seals-ai,https://x.com/luismariogm
seis,https://twitter.com/TrevMcKendrick
sensei,https://twitter.com/ycombinator
sensorsurf,https://twitter.com/noahjepstein
sepal-ai,https://www.twitter.com/katqhu1
serial,https://twitter.com/Serialmfg
serif-health,https://www.twitter.com/mfrobben
serra,https://twitter.com/ycombinator
shasta-health,https://twitter.com/SrinjoyMajumdar
shekel-mobility,https://twitter.com/ShekelMobility
shortbread,https://twitter.com/ShortbreadAI
showandtell,https://twitter.com/ycombinator
sidenote,https://twitter.com/jclin22009
sieve,https://twitter.com/mokshith_v
silkchart,https://twitter.com/afakerele
simple-ai,https://twitter.com/catheryn_li
simplehash,https://twitter.com/Alex_Kilkka
simplex,https://x.com/simplexdata
simplifine,https://x.com/egekduman
sizeless,https://twitter.com/cornelius_einem
skyvern,https://x.com/itssuchintan
slingshot,https://twitter.com/ycombinator
snowpilot,https://x.com/snowpilotai
soff,https://x.com/BernhardHausle1
solum-health,https://twitter.com/ycombinator
sonnet,https://twitter.com/ycombinator
sophys,https://twitter.com/ycombinator
sorcerer,https://x.com/big_veech
soteri-skin,https://twitter.com/SoteriSkin
sphere,https://twitter.com/nrudder_
spine-ai,https://twitter.com/BudhkarAkshay
spongecake,https://twitter.com/ycombinator
spur,https://twitter.com/sneha8sivakumar
sre-ai,https://twitter.com/ycombinator
stably,https://x.com/JinjingLiang
stack-ai,https://twitter.com/bernaceituno
stellar,https://twitter.com/ycombinator
stormy-ai-autonomous-marketing-agent,https://twitter.com/karmedge/
strada,https://twitter.com/AmirProd1
stream,https://twitter.com/ycombinator
structured-labs,https://twitter.com/amruthagujjar
studdy,https://twitter.com/mike_lamma
subscriptionflow,https://twitter.com/KashifSaleemCEO
subsets,https://twitter.com/ycombinator
supercontrast,https://twitter.com/ycombinator
supertone,https://twitter.com/trysupertone
superunit,https://x.com/peter_marler
sweep,https://twitter.com/wwzeng1
syncly,https://x.com/synclyhq
synnax,https://x.com/Emilbon99
syntheticfi,https://x.com/SyntheticFi_SF
t3-chat-prev-ping-gg,https://twitter.com/t3dotgg
tableflow,https://twitter.com/mitchpatin
tai,https://twitter.com/Tragen_ai
tandem-2,https://x.com/Tandemspace
taxgpt,https://twitter.com/ChKashifAli
taylor-ai,https://twitter.com/brian_j_kim
teamout,https://twitter.com/ycombinator
tegon,https://twitter.com/harshithb4h
terminal,https://x.com/withterminal
theneo,https://twitter.com/robakid
theya,https://twitter.com/vikasch
thyme,https://twitter.com/ycombinator
tiny,https://twitter.com/ycombinator
tola,https://twitter.com/alencvisic
trainy,https://twitter.com/TrainyAI
trendex-we-tokenize-talent,https://twitter.com/ycombinator
trueplace,https://twitter.com/ycombinator
truewind,https://twitter.com/AlexLee611
trusty,https://twitter.com/trustyhomes
truva,https://twitter.com/gaurav_aggarwal
tuesday,https://twitter.com/kai_jiabo_feng
twenty,https://twitter.com/twentycrm
twine,https://twitter.com/anandvalavalkar
two-dots,https://twitter.com/HensonOrser1
typa,https://twitter.com/sounhochung
typeless,https://twitter.com/ycombinator
unbound,https://twitter.com/ycombinator
undermind,https://twitter.com/UndermindAI
unison,https://twitter.com/maxim_xyz
unlayer,https://twitter.com/adeelraza
unstatiq,https://twitter.com/NishSingaraju
unusual,https://x.com/willwjack
upfront,https://twitter.com/KnowUpfront
vaero,https://twitter.com/ycombinator
vango-ai,https://twitter.com/vango_ai
variance,https://twitter.com/karinemellata
variant,https://twitter.com/bnj
velos,https://twitter.com/OscarMHBF
velt,https://twitter.com/rakesh_goyal
vendra,https://x.com/vendraHQ
vera-health,https://x.com/_maximall
verata,https://twitter.com/ycombinator
versive,https://twitter.com/getversive
vessel,https://twitter.com/vesselapi
vibe,https://twitter.com/ycombinator
videogen,https://twitter.com/ycombinator
vigilant,https://twitter.com/BenShumaker_
vitalize-care,https://twitter.com/nikhiljdsouza
viva-labs,https://twitter.com/vishal_the_jain
vizly,https://twitter.com/vizlyhq
vly-ai-2,https://x.com/victorxheng
vocode,https://twitter.com/kianhooshmand
void,https://x.com/parel_es
voltic,https://twitter.com/ycombinator
vooma,https://twitter.com/jessebucks
wingback,https://twitter.com/tfriehe_
winter,https://twitter.com/AzianMike
wolfia,https://twitter.com/narenmano
wordware,https://twitter.com/kozerafilip
zenbase-ai,https://twitter.com/CyrusOfEden
zeropath,https://x.com/zeropathAI
1 Company Link
2 1849-bio https://x.com/1849bio
3 1stcollab https://twitter.com/ycombinator
4 abundant https://x.com/abundant_labs
5 activepieces https://mobile.twitter.com/mabuaboud
6 acx https://twitter.com/ycombinator
7 adri-ai https://twitter.com/darshitac_
8 affil-ai https://twitter.com/ycombinator
9 agave https://twitter.com/moyicat
10 aglide https://twitter.com/pdmcguckian
11 ai-2 https://twitter.com/the_yuppy
12 ai-sell https://x.com/liuzjerry
13 airtrain-ai https://twitter.com/neutralino1
14 aisdr https://twitter.com/YuriyZaremba
15 alex https://x.com/DanielEdrisian
16 alga-biosciences https://twitter.com/algabiosciences
17 alguna https://twitter.com/aleks_djekic
18 alixia https://twitter.com/ycombinator
19 aminoanalytica https://x.com/lilwuuzivert
20 anara https://twitter.com/naveedjanmo
21 andi https://twitter.com/MiamiAngela
22 andoria https://x.com/dbudimane
23 andromeda-surgical https://twitter.com/nickdamian0
24 anglera https://twitter.com/ycombinator
25 angstrom-ai https://twitter.com/JaviAC7
26 ankr-health https://twitter.com/Ankr_us
27 apoxy https://twitter.com/ycombinator
28 apten https://twitter.com/dho1357
29 aragorn-ai https://twitter.com/ycombinator
30 arc-2 https://twitter.com/DarkMirage
31 archilabs https://twitter.com/ycombinator
32 arcimus https://twitter.com/husseinsyed73
33 argovox https://www.argovox.com/
34 artemis-search https://twitter.com/ycombinator
35 artie https://x.com/JacquelineSYC19
36 asklio https://twitter.com/butterflock
37 atlas-2 https://twitter.com/jobryan
38 attain https://twitter.com/aamir_hudda
39 autocomputer https://twitter.com/madhavsinghal_
40 automat https://twitter.com/lucas0choa
41 automorphic https://twitter.com/sandkoan
42 autopallet-robotics https://twitter.com/ycombinator
43 autumn-labs https://twitter.com/ycombinator
44 aviary https://twitter.com/ycombinator
45 azuki https://twitter.com/VamptVo
46 banabo https://twitter.com/ycombinator
47 baseline-ai https://twitter.com/ycombinator
48 baserun https://twitter.com/effyyzhang
49 benchify https://www.x.com/maxvonhippel
50 berry https://twitter.com/annchanyt
51 bifrost https://twitter.com/0xMysterious
52 bifrost-orbital https://x.com/ionkarbatra
53 biggerpicture https://twitter.com/ycombinator
54 biocartesian https://twitter.com/ycombinator
55 bland-ai https://twitter.com/zaygranet
56 blast https://x.com/useblast
57 blaze https://twitter.com/larfy_rothwell
58 bluebirds https://twitter.com/RohanPunamia
59 bluedot https://twitter.com/selinayfilizp
60 bluehill-payments https://twitter.com/HimanshuMinocha
61 blyss https://twitter.com/blyssdev
62 bolto https://twitter.com/mrinalsingh02?lang=en
63 botcity https://twitter.com/lorhancaproni
64 boundo https://twitter.com/ycombinator
65 bramble https://x.com/meksikanpijha
66 bricksai https://twitter.com/ycombinator
67 broccoli-ai https://twitter.com/abhishekjain25
68 bronco-ai https://twitter.com/dluozhang
69 bunting-labs https://twitter.com/normconstant
70 byterat https://twitter.com/penelopekjones_
71 callback https://twitter.com/ycombinator
72 cambio-2 https://twitter.com/ycombinator
73 camfer https://x.com/AryaBastani
74 campfire-2 https://twitter.com/ycombinator
75 campfire-applied-ai-company https://twitter.com/siamakfr
76 candid https://x.com/kesavkosana
77 canvas https://x.com/essamsleiman
78 capsule https://twitter.com/kelsey_pedersen
79 cardinal http://twitter.com/nadavwiz
80 cardinal-gray https://twitter.com/ycombinator
81 cargo https://twitter.com/aureeaubert
82 cartage https://twitter.com/ycombinator
83 cashmere https://twitter.com/shashankbuilds
84 cedalio https://twitter.com/LucianaReznik
85 cekura-2 https://x.com/tarush_agarwal_
86 central https://twitter.com/nilaymod
87 champ https://twitter.com/ycombinator
88 cheers https://twitter.com/ycombinator
89 chequpi https://twitter.com/sudshekhar02
90 chima https://twitter.com/nikharanirghin
91 cinapse https://www.twitter.com/hgphillipsiv
92 ciro https://twitter.com/davidjwiner
93 clara https://x.com/levinsonjon
94 cleancard https://twitter.com/_tom_dot_com
95 clearspace https://twitter.com/rbfasho
96 cobbery https://twitter.com/Dan_The_Goodman
97 codeviz https://x.com/liam_prev
98 coil-inc https://twitter.com/ycombinator
99 coldreach https://twitter.com/ycombinator
100 combinehealth https://twitter.com/ycombinator
101 comfy-deploy https://twitter.com/nicholaskkao
102 complete https://twitter.com/ranimavram
103 conductor-quantum https://twitter.com/BrandonSeverin
104 conduit https://twitter.com/ycombinator
105 continue https://twitter.com/tylerjdunn
106 contour https://twitter.com/ycombinator
107 coperniq https://twitter.com/abdullahzandani
108 corgea https://twitter.com/asadeddin
109 corgi https://twitter.com/nico_laqua?lang=en
110 corgi-labs https://twitter.com/ycombinator
111 coris https://twitter.com/psvinodh
112 cosine https://twitter.com/AlistairPullen
113 courtyard-io https://twitter.com/lejeunedall
114 coverage-cat https://twitter.com/coveragecats
115 craftos https://twitter.com/wa3l
116 craniometrix https://craniometrix.com
117 ctgt https://twitter.com/cyrilgorlla
118 curo https://x.com/EnergizedAndrew
119 dagworks-inc https://twitter.com/dagworks
120 dart https://twitter.com/milad3malek
121 dashdive https://twitter.com/micahawheat
122 dataleap https://twitter.com/jh_damm
123 decisional-ai https://x.com/groovetandon
124 decoda-health https://twitter.com/ycombinator
125 deepsilicon https://x.com/abhireddy2004
126 delfino-ai https://twitter.com/ycombinator
127 demo-gorilla https://twitter.com/ycombinator
128 demospace https://www.twitter.com/nick_fiacco
129 dench-com https://www.twitter.com/markrachapoom
130 denormalized https://twitter.com/IAmMattGreen
131 dev-tools-ai https://twitter.com/ycombinator
132 diffusion-studio https://x.com/MatthiasRuiz22
133 digitalcarbon https://x.com/CtrlGuruDelete
134 dimely https://x.com/UseDimely
135 disputeninja https://twitter.com/legitmaxwu
136 diversion https://twitter.com/sasham1
137 dmodel https://twitter.com/dmooooon
138 doctor-droid https://twitter.com/TheBengaluruGuy
139 dodo https://x.com/dominik_moehrle
140 dojah-inc https://twitter.com/ololaday
141 domu-technology-inc https://twitter.com/ycombinator
142 dr-treat https://twitter.com/rakeshtondon
143 dreamrp https://x.com/dreamrpofficial
144 drivingforce https://twitter.com/drivingforcehq
145 dynamo-ai https://twitter.com/dynamo_fl
146 edgebit https://twitter.com/robszumski
147 educato-ai https://x.com/FelixGabler
148 electric-air-2 https://twitter.com/JezOsborne
149 ember https://twitter.com/hsinleiwang
150 ember-robotics https://twitter.com/ycombinator
151 emergent https://twitter.com/mukundjha
152 emobi https://twitter.com/ycombinator
153 entangl https://twitter.com/Shapol_m
154 envelope https://twitter.com/joshuakcockrell
155 et-al https://twitter.com/ycombinator
156 eugit-therapeutics http://www.eugittx.com
157 eventual https://twitter.com/sammy_sidhu
158 evoly https://twitter.com/ycombinator
159 expand-ai https://twitter.com/timsuchanek
160 ezdubs https://twitter.com/PadmanabhanKri
161 fabius https://twitter.com/adayNU
162 fazeshift https://twitter.com/ycombinator
163 felafax https://twitter.com/ThatNithin
164 fetchr https://twitter.com/CalvinnChenn
165 fiber-ai https://twitter.com/AdiAgashe
166 ficra https://x.com/ficra_ai
167 fiddlecube https://twitter.com/nupoor_neha
168 finic https://twitter.com/jfan001
169 finta https://www.twitter.com/andywang
170 fintool https://twitter.com/nicbstme
171 finvest https://twitter.com/shivambharuka
172 firecrawl https://x.com/ericciarla
173 firstwork https://twitter.com/techie_Shubham
174 fixa https://x.com/jonathanzliu
175 flair-health https://twitter.com/adivawhocodes
176 fleek https://twitter.com/ycombinator
177 fleetworks https://twitter.com/ycombinator
178 flike https://twitter.com/yajmch
179 flint-2 https://twitter.com/hungrysohan
180 floworks https://twitter.com/sarthaks92
181 focus-buddy https://twitter.com/yash14700/
182 forerunner-ai https://x.com/willnida0
183 founders https://twitter.com/ycombinator
184 foundry https://x.com/FoundryAI_
185 freestyle https://x.com/benswerd
186 fresco https://twitter.com/ycombinator
187 friday https://x.com/AllenNaliath
188 frigade https://twitter.com/FrigadeHQ
189 futureclinic https://twitter.com/usamasyedmd
190 gait https://twitter.com/AlexYHsia
191 galini https://twitter.com/ycombinator
192 gauge https://twitter.com/the1024th
193 gecko-security https://x.com/jjjutla
194 general-analysis https://twitter.com/ycombinator
195 giga-ml https://twitter.com/varunvummadi
196 glade https://twitter.com/ycombinator
197 glass-health https://twitter.com/dereckwpaul
198 goodfin https://twitter.com/ycombinator
199 grai https://twitter.com/ycombinator
200 greenlite https://twitter.com/will_lawrenceTO
201 grey https://www.twitter.com/kingidee
202 happyrobot https://twitter.com/pablorpalafox
203 haystack-software https://x.com/AkshaySubr42403
204 health-harbor https://twitter.com/AlanLiu96
205 healthspark https://twitter.com/stephengrinich
206 hedgehog-2 https://twitter.com/ycombinator
207 helicone https://twitter.com/justinstorre
208 heroui https://x.com/jrgarciadev
209 hoai https://twitter.com/ycombinator
210 hockeystack https://twitter.com/ycombinator
211 hokali https://twitter.com/hokalico
212 homeflow https://twitter.com/ycombinator
213 hubble-network https://twitter.com/BenWild10
214 humand https://twitter.com/nicolasbenenzon
215 humanlayer https://twitter.com/dexhorthy
216 hydra https://twitter.com/JoeSciarrino
217 hyperbound https://twitter.com/sguduguntla
218 ideate-xyz https://twitter.com/nomocodes
219 inbuild https://twitter.com/TySharp_iB
220 indexical https://twitter.com/try_nebula
221 industrial-next https://twitter.com/ycombinator
222 infisical https://twitter.com/matsiiako
223 inkeep https://twitter.com/nickgomezc
224 inlet-2 https://twitter.com/inlet_ai
225 innkeeper https://twitter.com/tejasybhakta
226 instant https://twitter.com/JoeAverbukh
227 integrated-reasoning https://twitter.com/d4r5c2
228 interlock https://twitter.com/ycombinator
229 intryc https://x.com/alexmarantelos?lang=en
230 invert https://twitter.com/purrmin
231 iollo https://twitter.com/daniel_gomari
232 jamble https://twitter.com/ycombinator
233 joon-health https://twitter.com/IsaacVanEaves
234 juicebox https://twitter.com/davepaffenholz
235 julius https://twitter.com/0interestrates
236 karmen https://twitter.com/ycombinator
237 kenley https://x.com/KenleyAI
238 keylika https://twitter.com/buddhachaudhuri
239 khoj https://twitter.com/debanjum
240 kite https://twitter.com/DerekFeehrer
241 kivo-health https://twitter.com/vaughnkoch
242 knowtex https://twitter.com/CarolineCZhang
243 koala https://twitter.com/studioseinstein?s=11
244 kopra-bio https://x.com/AF_Haddad
245 kura https://x.com/kura_labs
246 laminar https://twitter.com/skull8888888888
247 lancedb https://twitter.com/changhiskhan
248 latent https://twitter.com/ycombinator
249 layerup https://twitter.com/arnavbathla20
250 lazyeditor https://twitter.com/jee_cash
251 ledgerup https://twitter.com/josephrjohnson
252 lifelike https://twitter.com/alecxiang1
253 lighthouz-ai https://x.com/srijankedia
254 lightski https://www.twitter.com/hansenq
255 ligo-biosciences https://x.com/ArdaGoreci/status/1830744265007480934
256 line-build https://twitter.com/ycombinator
257 lingodotdev https://twitter.com/maxprilutskiy
258 linkgrep https://twitter.com/linkgrep
259 linum https://twitter.com/schopra909
260 livedocs https://twitter.com/arsalanbashir
261 luca https://twitter.com/LucaPricingHq
262 lumenary https://twitter.com/vivekhaz
263 lune https://x.com/samuelp4rk
264 lynx https://twitter.com/ycombinator
265 magic-loops https://twitter.com/jumploops
266 manaflow https://twitter.com/austinywang
267 mandel-ai https://twitter.com/shmkkr
268 martin https://twitter.com/martinvoiceai
269 matano https://twitter.com/AhmedSamrose
270 mdhub https://twitter.com/ealamolda
271 mederva-health http://twitter.com/sabihmir
272 medplum https://twitter.com/ReshmaKhilnani
273 melty https://x.com/charliebholtz
274 mem0 https://twitter.com/taranjeetio
275 mercator https://www.twitter.com/ajdstein
276 mercoa https://twitter.com/Sarora27
277 meru https://twitter.com/rohanarora_
278 metalware https://twitter.com/ryanchowww
279 metriport https://twitter.com/dimagoncharov_
280 mica-ai https://twitter.com/ycombinator
281 middleware https://twitter.com/laduramvishnoi
282 midship https://twitter.com/_kietay
283 mintlify https://twitter.com/hanwangio
284 minusx https://twitter.com/nuwandavek
285 miracle https://twitter.com/ycombinator
286 miru-ml https://twitter.com/armelwtalla
287 mito-health https://twitter.com/teemingchew
288 mocha https://twitter.com/nichochar
289 modern-realty https://x.com/RIsanians
290 modulari-t https://twitter.com/ycombinator
291 mogara https://twitter.com/ycombinator
292 monterey-ai https://twitter.com/chunonline
293 moonglow https://twitter.com/leilavclark
294 moonshine https://x.com/useMoonshine
295 moreta https://twitter.com/ycombinator
296 mutable-ai https://x.com/smahsramo
297 myria https://twitter.com/reyflemings
298 nango https://twitter.com/rguldener
299 nanograb https://twitter.com/lauhoyeung
300 nara https://twitter.com/join_nara
301 narrative https://twitter.com/axitkhurana
302 nectar https://twitter.com/AllenWang314
303 neosync https://twitter.com/evisdrenova
304 nerve https://x.com/fortress_build
305 networkocean https://twitter.com/sammendel4
306 ngrow-ai https://twitter.com/ycombinator
307 no-cap https://x.com/nocapso
308 nowadays https://twitter.com/ycombinator
309 numeral https://www.twitter.com/mduvall_
310 obento-health https://twitter.com/ycombinator
311 octopipe https://twitter.com/abhishekray07
312 odo https://twitter.com/ycombinator
313 ofone https://twitter.com/ycombinator
314 onetext http://twitter.com/jfudem
315 openfunnel https://x.com/fenilsuchak
316 opensight https://twitter.com/OpenSightAI
317 ora-ai https://twitter.com/ryan_rl_phelps
318 orchid https://twitter.com/ycombinator
319 origami-agents https://x.com/fin465
320 outerbase https://www.twitter.com/burcs
321 outerport https://x.com/yongyuanxi
322 outset https://twitter.com/AaronLCannon
323 overeasy https://twitter.com/skyflylu
324 overlap https://x.com/jbaerofficial
325 oway https://twitter.com/owayinc
326 ozone https://twitter.com/maxvwolff
327 pair-ai https://twitter.com/ycombinator
328 palmier https://twitter.com/ycombinator
329 panora https://twitter.com/rflih_
330 parabolic https://twitter.com/ycombinator
331 paragon-ai https://twitter.com/ycombinator
332 parahelp https://twitter.com/ankerbachryhl
333 parity https://x.com/wilson_spearman
334 parley https://twitter.com/ycombinator
335 patched https://x.com/rohan_sood15
336 pearson-labs https://twitter.com/ycombinator
337 pelm https://twitter.com/ycombinator
338 penguin-ai https://twitter.com/ycombinator
339 peoplebox https://twitter.com/abhichugh
340 permitflow https://twitter.com/ycombinator
341 permitportal https://twitter.com/rgmazilu
342 persana-ai https://www.twitter.com/tweetsreez
343 pharos https://x.com/felix_brann
344 phind https://twitter.com/michaelroyzen
345 phonely https://x.com/phonely_ai
346 pier https://twitter.com/ycombinator
347 pierre https://twitter.com/fat
348 pinnacle https://twitter.com/SeanRoades
349 pipeshift https://x.com/FerraoEnrique
350 pivot https://twitter.com/raimietang
351 planbase https://twitter.com/ycombinator
352 plover-parametrics https://twitter.com/ycombinator
353 plutis https://twitter.com/kamil_m_ali
354 poka-labs https://twitter.com/ycombinator
355 poly https://twitter.com/Denizen_Kane
356 polymath-robotics https://twitter.com/stefanesa
357 ponyrun https://twitter.com/ycombinator
358 poplarml https://twitter.com/dnaliu17
359 posh https://twitter.com/PoshElectric
360 power-to-the-brand https://twitter.com/ycombinator
361 primevault https://twitter.com/prashantupd
362 prohostai https://twitter.com/bilguunu
363 promptloop https://twitter.com/PeterbMangan
364 propaya https://x.com/PropayaOfficial
365 proper https://twitter.com/kylemaloney_
366 proprise https://twitter.com/kragerDev
367 protegee https://x.com/kirthibanothu
368 pump-co https://www.twitter.com/spndn07/
369 pumpkin https://twitter.com/SamuelCrombie
370 pure https://twitter.com/collectpure
371 pylon-2 https://x.com/marty_kausas
372 pyq-ai https://twitter.com/araghuvanshi2
373 query-vary https://twitter.com/DJFinetunes
374 rankai https://x.com/rankai_ai
375 rastro https://twitter.com/baptiste_cumin
376 reactwise https://twitter.com/ycombinator
377 read-bean https://twitter.com/maggieqzhang
378 readily https://twitter.com/ycombinator
379 redouble-ai https://twitter.com/pneumaticdill?s=21
380 refine https://twitter.com/civanozseyhan
381 reflex https://twitter.com/getreflex
382 reforged-labs https://twitter.com/ycombinator
383 relace https://twitter.com/ycombinator
384 relate https://twitter.com/chrischae__
385 remade https://x.com/Christos_antono
386 remy https://twitter.com/ycombinator
387 remy-2 https://x.com/remysearch
388 rentflow https://twitter.com/ycombinator
389 requestly https://twitter.com/sachinjain024
390 resend https://x.com/zenorocha
391 respaid https://twitter.com/johnbanr
392 reticular https://x.com/nithinparsan
393 retrofix-ai https://twitter.com/danieldoesdev
394 revamp https://twitter.com/getrevamp_ai
395 revyl https://x.com/landseerenga
396 reworkd https://twitter.com/asimdotshrestha
397 reworks https://twitter.com/ycombinator
398 rift https://twitter.com/FilipTwarowski
399 riskangle https://twitter.com/ycombinator
400 riskcube https://x.com/andrei_risk
401 rivet https://twitter.com/nicholaskissel
402 riveter-ai https://x.com/AGrillz
403 roame https://x.com/timtqin
404 roforco https://x.com/brain_xiang
405 rome https://twitter.com/craigzLiszt
406 roomplays https://twitter.com/criyaco
407 rosebud-biosciences https://twitter.com/KitchenerWilson
408 rowboat-labs https://twitter.com/segmenta
409 rubber-ducky-labs https://twitter.com/alexandraj777
410 ruleset https://twitter.com/LoganFrederick
411 ryvn https://x.com/ryvnai
412 safetykit https://twitter.com/ycombinator
413 sage-ai https://twitter.com/akhilmurthy20
414 saldor https://x.com/notblandjacob
415 salient https://twitter.com/ycombinator
416 schemeflow https://x.com/browninghere
417 sculpt https://twitter.com/ycombinator
418 seals-ai https://x.com/luismariogm
419 seis https://twitter.com/TrevMcKendrick
420 sensei https://twitter.com/ycombinator
421 sensorsurf https://twitter.com/noahjepstein
422 sepal-ai https://www.twitter.com/katqhu1
423 serial https://twitter.com/Serialmfg
424 serif-health https://www.twitter.com/mfrobben
425 serra https://twitter.com/ycombinator
426 shasta-health https://twitter.com/SrinjoyMajumdar
427 shekel-mobility https://twitter.com/ShekelMobility
428 shortbread https://twitter.com/ShortbreadAI
429 showandtell https://twitter.com/ycombinator
430 sidenote https://twitter.com/jclin22009
431 sieve https://twitter.com/mokshith_v
432 silkchart https://twitter.com/afakerele
433 simple-ai https://twitter.com/catheryn_li
434 simplehash https://twitter.com/Alex_Kilkka
435 simplex https://x.com/simplexdata
436 simplifine https://x.com/egekduman
437 sizeless https://twitter.com/cornelius_einem
438 skyvern https://x.com/itssuchintan
439 slingshot https://twitter.com/ycombinator
440 snowpilot https://x.com/snowpilotai
441 soff https://x.com/BernhardHausle1
442 solum-health https://twitter.com/ycombinator
443 sonnet https://twitter.com/ycombinator
444 sophys https://twitter.com/ycombinator
445 sorcerer https://x.com/big_veech
446 soteri-skin https://twitter.com/SoteriSkin
447 sphere https://twitter.com/nrudder_
448 spine-ai https://twitter.com/BudhkarAkshay
449 spongecake https://twitter.com/ycombinator
450 spur https://twitter.com/sneha8sivakumar
451 sre-ai https://twitter.com/ycombinator
452 stably https://x.com/JinjingLiang
453 stack-ai https://twitter.com/bernaceituno
454 stellar https://twitter.com/ycombinator
455 stormy-ai-autonomous-marketing-agent https://twitter.com/karmedge/
456 strada https://twitter.com/AmirProd1
457 stream https://twitter.com/ycombinator
458 structured-labs https://twitter.com/amruthagujjar
459 studdy https://twitter.com/mike_lamma
460 subscriptionflow https://twitter.com/KashifSaleemCEO
461 subsets https://twitter.com/ycombinator
462 supercontrast https://twitter.com/ycombinator
463 supertone https://twitter.com/trysupertone
464 superunit https://x.com/peter_marler
465 sweep https://twitter.com/wwzeng1
466 syncly https://x.com/synclyhq
467 synnax https://x.com/Emilbon99
468 syntheticfi https://x.com/SyntheticFi_SF
469 t3-chat-prev-ping-gg https://twitter.com/t3dotgg
470 tableflow https://twitter.com/mitchpatin
471 tai https://twitter.com/Tragen_ai
472 tandem-2 https://x.com/Tandemspace
473 taxgpt https://twitter.com/ChKashifAli
474 taylor-ai https://twitter.com/brian_j_kim
475 teamout https://twitter.com/ycombinator
476 tegon https://twitter.com/harshithb4h
477 terminal https://x.com/withterminal
478 theneo https://twitter.com/robakid
479 theya https://twitter.com/vikasch
480 thyme https://twitter.com/ycombinator
481 tiny https://twitter.com/ycombinator
482 tola https://twitter.com/alencvisic
483 trainy https://twitter.com/TrainyAI
484 trendex-we-tokenize-talent https://twitter.com/ycombinator
485 trueplace https://twitter.com/ycombinator
486 truewind https://twitter.com/AlexLee611
487 trusty https://twitter.com/trustyhomes
488 truva https://twitter.com/gaurav_aggarwal
489 tuesday https://twitter.com/kai_jiabo_feng
490 twenty https://twitter.com/twentycrm
491 twine https://twitter.com/anandvalavalkar
492 two-dots https://twitter.com/HensonOrser1
493 typa https://twitter.com/sounhochung
494 typeless https://twitter.com/ycombinator
495 unbound https://twitter.com/ycombinator
496 undermind https://twitter.com/UndermindAI
497 unison https://twitter.com/maxim_xyz
498 unlayer https://twitter.com/adeelraza
499 unstatiq https://twitter.com/NishSingaraju
500 unusual https://x.com/willwjack
501 upfront https://twitter.com/KnowUpfront
502 vaero https://twitter.com/ycombinator
503 vango-ai https://twitter.com/vango_ai
504 variance https://twitter.com/karinemellata
505 variant https://twitter.com/bnj
506 velos https://twitter.com/OscarMHBF
507 velt https://twitter.com/rakesh_goyal
508 vendra https://x.com/vendraHQ
509 vera-health https://x.com/_maximall
510 verata https://twitter.com/ycombinator
511 versive https://twitter.com/getversive
512 vessel https://twitter.com/vesselapi
513 vibe https://twitter.com/ycombinator
514 videogen https://twitter.com/ycombinator
515 vigilant https://twitter.com/BenShumaker_
516 vitalize-care https://twitter.com/nikhiljdsouza
517 viva-labs https://twitter.com/vishal_the_jain
518 vizly https://twitter.com/vizlyhq
519 vly-ai-2 https://x.com/victorxheng
520 vocode https://twitter.com/kianhooshmand
521 void https://x.com/parel_es
522 voltic https://twitter.com/ycombinator
523 vooma https://twitter.com/jessebucks
524 wingback https://twitter.com/tfriehe_
525 winter https://twitter.com/AzianMike
526 wolfia https://twitter.com/narenmano
527 wordware https://twitter.com/kozerafilip
528 zenbase-ai https://twitter.com/CyrusOfEden
529 zeropath https://x.com/zeropathAI

Some files were not shown because too many files have changed in this diff Show More