mirror of
https://github.com/onyx-dot-app/onyx.git
synced 2026-02-17 07:45:47 +00:00
Compare commits
34 Commits
| Author | SHA1 | Date | |
|---|---|---|---|
|
|
03dfa0fcc0 | ||
|
|
0acd50b75d | ||
|
|
c3c9a0e57c | ||
|
|
ef978aea97 | ||
|
|
15ab0586df | ||
|
|
839c8611b7 | ||
|
|
68f9f157a6 | ||
|
|
9dd56a5c80 | ||
|
|
842a73a242 | ||
|
|
c04c1ea31b | ||
|
|
2380c2266c | ||
|
|
b02af9b280 | ||
|
|
42938dcf62 | ||
|
|
93886f0e2c | ||
|
|
8c3a953b7a | ||
|
|
54b883d0ca | ||
|
|
91faac5447 | ||
|
|
1d8f9fc39d | ||
|
|
9390de21e5 | ||
|
|
3a33433fc9 | ||
|
|
c4865d57b1 | ||
|
|
81d04db08f | ||
|
|
d50a17db21 | ||
|
|
dc5a1e8fd0 | ||
|
|
c0b3681650 | ||
|
|
7ec04484d4 | ||
|
|
1cf966ecc1 | ||
|
|
8a8526dbbb | ||
|
|
be20586ba1 | ||
|
|
a314462d1e | ||
|
|
155f53c3d7 | ||
|
|
7c027df186 | ||
|
|
0a5db96026 | ||
|
|
daef985b02 |
@@ -23,6 +23,10 @@ env:
|
||||
# Jira
|
||||
JIRA_USER_EMAIL: ${{ secrets.JIRA_USER_EMAIL }}
|
||||
JIRA_API_TOKEN: ${{ secrets.JIRA_API_TOKEN }}
|
||||
|
||||
GONG_ACCESS_KEY: ${{ secrets.GONG_ACCESS_KEY }}
|
||||
GONG_ACCESS_KEY_SECRET: ${{ secrets.GONG_ACCESS_KEY_SECRET }}
|
||||
|
||||
# Google
|
||||
GOOGLE_DRIVE_SERVICE_ACCOUNT_JSON_STR: ${{ secrets.GOOGLE_DRIVE_SERVICE_ACCOUNT_JSON_STR }}
|
||||
GOOGLE_DRIVE_OAUTH_CREDENTIALS_JSON_STR_TEST_USER_1: ${{ secrets.GOOGLE_DRIVE_OAUTH_CREDENTIALS_JSON_STR_TEST_USER_1 }}
|
||||
|
||||
64
README.md
64
README.md
@@ -30,30 +30,26 @@ Keep knowledge and access controls sync-ed across over 40 connectors like Google
|
||||
Create custom AI agents with unique prompts, knowledge, and actions that the agents can take.
|
||||
Onyx can be deployed securely anywhere and for any scale - on a laptop, on-premise, or to cloud.
|
||||
|
||||
|
||||
<h3>Feature Highlights</h3>
|
||||
|
||||
**Deep research over your team's knowledge:**
|
||||
|
||||
https://private-user-images.githubusercontent.com/32520769/414509312-48392e83-95d0-4fb5-8650-a396e05e0a32.mp4?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3Mzk5Mjg2MzYsIm5iZiI6MTczOTkyODMzNiwicGF0aCI6Ii8zMjUyMDc2OS80MTQ1MDkzMTItNDgzOTJlODMtOTVkMC00ZmI1LTg2NTAtYTM5NmUwNWUwYTMyLm1wND9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNTAyMTklMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjUwMjE5VDAxMjUzNlomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPWFhMzk5Njg2Y2Y5YjFmNDNiYTQ2YzM5ZTg5YWJiYTU2NWMyY2YwNmUyODE2NWUxMDRiMWQxZWJmODI4YTA0MTUmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0In0.a9D8A0sgKE9AoaoE-mfFbJ6_OKYeqaf7TZ4Han2JfW8
|
||||
|
||||
|
||||
**Use Onyx as a secure AI Chat with any LLM:**
|
||||
|
||||

|
||||
|
||||
|
||||
**Easily set up connectors to your apps:**
|
||||
|
||||

|
||||
|
||||
|
||||
**Access Onyx where your team already works:**
|
||||
|
||||

|
||||
|
||||
|
||||
## Deployment
|
||||
|
||||
**To try it out for free and get started in seconds, check out [Onyx Cloud](https://cloud.onyx.app/signup)**.
|
||||
|
||||
Onyx can also be run locally (even on a laptop) or deployed on a virtual machine with a single
|
||||
@@ -62,23 +58,23 @@ Onyx can also be run locally (even on a laptop) or deployed on a virtual machine
|
||||
We also have built-in support for high-availability/scalable deployment on Kubernetes.
|
||||
References [here](https://github.com/onyx-dot-app/onyx/tree/main/deployment).
|
||||
|
||||
|
||||
## 🔍 Other Notable Benefits of Onyx
|
||||
|
||||
- Custom deep learning models for indexing and inference time, only through Onyx + learning from user feedback.
|
||||
- Flexible security features like SSO (OIDC/SAML/OAuth2), RBAC, encryption of credentials, etc.
|
||||
- Knowledge curation features like document-sets, query history, usage analytics, etc.
|
||||
- Scalable deployment options tested up to many tens of thousands users and hundreds of millions of documents.
|
||||
|
||||
|
||||
## 🚧 Roadmap
|
||||
|
||||
- New methods in information retrieval (StructRAG, LightGraphRAG, etc.)
|
||||
- Personalized Search
|
||||
- Organizational understanding and ability to locate and suggest experts from your team.
|
||||
- Code Search
|
||||
- SQL and Structured Query Language
|
||||
|
||||
|
||||
## 🔌 Connectors
|
||||
|
||||
Keep knowledge and access up to sync across 40+ connectors:
|
||||
|
||||
- Google Drive
|
||||
@@ -99,19 +95,65 @@ Keep knowledge and access up to sync across 40+ connectors:
|
||||
|
||||
See the full list [here](https://docs.onyx.app/connectors).
|
||||
|
||||
|
||||
## 📚 Licensing
|
||||
|
||||
There are two editions of Onyx:
|
||||
|
||||
- Onyx Community Edition (CE) is available freely under the MIT Expat license. Simply follow the Deployment guide above.
|
||||
- Onyx Enterprise Edition (EE) includes extra features that are primarily useful for larger organizations.
|
||||
For feature details, check out [our website](https://www.onyx.app/pricing).
|
||||
For feature details, check out [our website](https://www.onyx.app/pricing).
|
||||
|
||||
To try the Onyx Enterprise Edition:
|
||||
|
||||
1. Checkout [Onyx Cloud](https://cloud.onyx.app/signup).
|
||||
2. For self-hosting the Enterprise Edition, contact us at [founders@onyx.app](mailto:founders@onyx.app) or book a call with us on our [Cal](https://cal.com/team/onyx/founders).
|
||||
|
||||
|
||||
## 💡 Contributing
|
||||
|
||||
Looking to contribute? Please check out the [Contribution Guide](CONTRIBUTING.md) for more details.
|
||||
|
||||
# YC Company Twitter Scraper
|
||||
|
||||
A script that scrapes YC company pages and extracts Twitter/X.com links.
|
||||
|
||||
## Requirements
|
||||
|
||||
- Python 3.7+
|
||||
- Playwright
|
||||
|
||||
## Installation
|
||||
|
||||
1. Install the required packages:
|
||||
|
||||
```
|
||||
pip install -r requirements.txt
|
||||
```
|
||||
|
||||
2. Install Playwright browsers:
|
||||
```
|
||||
playwright install
|
||||
```
|
||||
|
||||
## Usage
|
||||
|
||||
Run the script with default settings:
|
||||
|
||||
```
|
||||
python scrape_yc_twitter.py
|
||||
```
|
||||
|
||||
This will scrape the YC companies from recent batches (W23, S23, S24, F24, S22, W22) and save the Twitter links to `twitter_links.txt`.
|
||||
|
||||
### Custom URL and Output
|
||||
|
||||
```
|
||||
python scrape_yc_twitter.py --url "https://www.ycombinator.com/companies?batch=W24" --output "w24_twitter.txt"
|
||||
```
|
||||
|
||||
## How it works
|
||||
|
||||
1. Navigates to the specified YC companies page
|
||||
2. Scrolls down to load all company cards
|
||||
3. Extracts links to individual company pages
|
||||
4. Visits each company page and extracts Twitter/X.com links
|
||||
5. Saves the results to a text file
|
||||
|
||||
45
YC_SCRAPER_README.md
Normal file
45
YC_SCRAPER_README.md
Normal file
@@ -0,0 +1,45 @@
|
||||
# YC Company Twitter Scraper
|
||||
|
||||
A script that scrapes YC company pages and extracts Twitter/X.com links.
|
||||
|
||||
## Requirements
|
||||
|
||||
- Python 3.7+
|
||||
- Playwright
|
||||
|
||||
## Installation
|
||||
|
||||
1. Install the required packages:
|
||||
|
||||
```
|
||||
pip install -r requirements.txt
|
||||
```
|
||||
|
||||
2. Install Playwright browsers:
|
||||
```
|
||||
playwright install
|
||||
```
|
||||
|
||||
## Usage
|
||||
|
||||
Run the script with default settings:
|
||||
|
||||
```
|
||||
python scrape_yc_twitter.py
|
||||
```
|
||||
|
||||
This will scrape the YC companies from recent batches (W23, S23, S24, F24, S22, W22) and save the Twitter links to `twitter_links.txt`.
|
||||
|
||||
### Custom URL and Output
|
||||
|
||||
```
|
||||
python scrape_yc_twitter.py --url "https://www.ycombinator.com/companies?batch=W24" --output "w24_twitter.txt"
|
||||
```
|
||||
|
||||
## How it works
|
||||
|
||||
1. Navigates to the specified YC companies page
|
||||
2. Scrolls down to load all company cards
|
||||
3. Extracts links to individual company pages
|
||||
4. Visits each company page and extracts Twitter/X.com links
|
||||
5. Saves the results to a text file
|
||||
@@ -46,6 +46,7 @@ WORKDIR /app
|
||||
|
||||
# Utils used by model server
|
||||
COPY ./onyx/utils/logger.py /app/onyx/utils/logger.py
|
||||
COPY ./onyx/utils/middleware.py /app/onyx/utils/middleware.py
|
||||
|
||||
# Place to fetch version information
|
||||
COPY ./onyx/__init__.py /app/onyx/__init__.py
|
||||
|
||||
@@ -0,0 +1,50 @@
|
||||
"""update prompt length
|
||||
|
||||
Revision ID: 4794bc13e484
|
||||
Revises: f7505c5b0284
|
||||
Create Date: 2025-04-02 11:26:36.180328
|
||||
|
||||
"""
|
||||
from alembic import op
|
||||
import sqlalchemy as sa
|
||||
|
||||
|
||||
# revision identifiers, used by Alembic.
|
||||
revision = "4794bc13e484"
|
||||
down_revision = "f7505c5b0284"
|
||||
branch_labels = None
|
||||
depends_on = None
|
||||
|
||||
|
||||
def upgrade() -> None:
|
||||
op.alter_column(
|
||||
"prompt",
|
||||
"system_prompt",
|
||||
existing_type=sa.TEXT(),
|
||||
type_=sa.String(length=5000000),
|
||||
existing_nullable=False,
|
||||
)
|
||||
op.alter_column(
|
||||
"prompt",
|
||||
"task_prompt",
|
||||
existing_type=sa.TEXT(),
|
||||
type_=sa.String(length=5000000),
|
||||
existing_nullable=False,
|
||||
)
|
||||
|
||||
|
||||
def downgrade() -> None:
|
||||
op.alter_column(
|
||||
"prompt",
|
||||
"system_prompt",
|
||||
existing_type=sa.String(length=5000000),
|
||||
type_=sa.TEXT(),
|
||||
existing_nullable=False,
|
||||
)
|
||||
op.alter_column(
|
||||
"prompt",
|
||||
"task_prompt",
|
||||
existing_type=sa.String(length=5000000),
|
||||
type_=sa.TEXT(),
|
||||
existing_nullable=False,
|
||||
)
|
||||
@@ -0,0 +1,50 @@
|
||||
"""add prompt length limit
|
||||
|
||||
Revision ID: f71470ba9274
|
||||
Revises: 6a804aeb4830
|
||||
Create Date: 2025-04-01 15:07:14.977435
|
||||
|
||||
"""
|
||||
|
||||
|
||||
# revision identifiers, used by Alembic.
|
||||
revision = "f71470ba9274"
|
||||
down_revision = "6a804aeb4830"
|
||||
branch_labels = None
|
||||
depends_on = None
|
||||
|
||||
|
||||
def upgrade() -> None:
|
||||
# op.alter_column(
|
||||
# "prompt",
|
||||
# "system_prompt",
|
||||
# existing_type=sa.TEXT(),
|
||||
# type_=sa.String(length=8000),
|
||||
# existing_nullable=False,
|
||||
# )
|
||||
# op.alter_column(
|
||||
# "prompt",
|
||||
# "task_prompt",
|
||||
# existing_type=sa.TEXT(),
|
||||
# type_=sa.String(length=8000),
|
||||
# existing_nullable=False,
|
||||
# )
|
||||
pass
|
||||
|
||||
|
||||
def downgrade() -> None:
|
||||
# op.alter_column(
|
||||
# "prompt",
|
||||
# "system_prompt",
|
||||
# existing_type=sa.String(length=8000),
|
||||
# type_=sa.TEXT(),
|
||||
# existing_nullable=False,
|
||||
# )
|
||||
# op.alter_column(
|
||||
# "prompt",
|
||||
# "task_prompt",
|
||||
# existing_type=sa.String(length=8000),
|
||||
# type_=sa.TEXT(),
|
||||
# existing_nullable=False,
|
||||
# )
|
||||
pass
|
||||
@@ -0,0 +1,77 @@
|
||||
"""updated constraints for ccpairs
|
||||
|
||||
Revision ID: f7505c5b0284
|
||||
Revises: f71470ba9274
|
||||
Create Date: 2025-04-01 17:50:42.504818
|
||||
|
||||
"""
|
||||
from alembic import op
|
||||
|
||||
|
||||
# revision identifiers, used by Alembic.
|
||||
revision = "f7505c5b0284"
|
||||
down_revision = "f71470ba9274"
|
||||
branch_labels = None
|
||||
depends_on = None
|
||||
|
||||
|
||||
def upgrade() -> None:
|
||||
# 1) Drop the old foreign-key constraints
|
||||
op.drop_constraint(
|
||||
"document_by_connector_credential_pair_connector_id_fkey",
|
||||
"document_by_connector_credential_pair",
|
||||
type_="foreignkey",
|
||||
)
|
||||
op.drop_constraint(
|
||||
"document_by_connector_credential_pair_credential_id_fkey",
|
||||
"document_by_connector_credential_pair",
|
||||
type_="foreignkey",
|
||||
)
|
||||
|
||||
# 2) Re-add them with ondelete='CASCADE'
|
||||
op.create_foreign_key(
|
||||
"document_by_connector_credential_pair_connector_id_fkey",
|
||||
source_table="document_by_connector_credential_pair",
|
||||
referent_table="connector",
|
||||
local_cols=["connector_id"],
|
||||
remote_cols=["id"],
|
||||
ondelete="CASCADE",
|
||||
)
|
||||
op.create_foreign_key(
|
||||
"document_by_connector_credential_pair_credential_id_fkey",
|
||||
source_table="document_by_connector_credential_pair",
|
||||
referent_table="credential",
|
||||
local_cols=["credential_id"],
|
||||
remote_cols=["id"],
|
||||
ondelete="CASCADE",
|
||||
)
|
||||
|
||||
|
||||
def downgrade() -> None:
|
||||
# Reverse the changes for rollback
|
||||
op.drop_constraint(
|
||||
"document_by_connector_credential_pair_connector_id_fkey",
|
||||
"document_by_connector_credential_pair",
|
||||
type_="foreignkey",
|
||||
)
|
||||
op.drop_constraint(
|
||||
"document_by_connector_credential_pair_credential_id_fkey",
|
||||
"document_by_connector_credential_pair",
|
||||
type_="foreignkey",
|
||||
)
|
||||
|
||||
# Recreate without CASCADE
|
||||
op.create_foreign_key(
|
||||
"document_by_connector_credential_pair_connector_id_fkey",
|
||||
"document_by_connector_credential_pair",
|
||||
"connector",
|
||||
["connector_id"],
|
||||
["id"],
|
||||
)
|
||||
op.create_foreign_key(
|
||||
"document_by_connector_credential_pair_credential_id_fkey",
|
||||
"document_by_connector_credential_pair",
|
||||
"credential",
|
||||
["credential_id"],
|
||||
["id"],
|
||||
)
|
||||
@@ -159,6 +159,9 @@ def _get_space_permissions(
|
||||
|
||||
# Stores the permissions for each space
|
||||
space_permissions_by_space_key[space_key] = space_permissions
|
||||
logger.info(
|
||||
f"Found space permissions for space '{space_key}': {space_permissions}"
|
||||
)
|
||||
|
||||
return space_permissions_by_space_key
|
||||
|
||||
|
||||
@@ -55,7 +55,7 @@ def _post_query_chunk_censoring(
|
||||
# if user is None, permissions are not enforced
|
||||
return chunks
|
||||
|
||||
chunks_to_keep = []
|
||||
final_chunk_dict: dict[str, InferenceChunk] = {}
|
||||
chunks_to_process: dict[DocumentSource, list[InferenceChunk]] = {}
|
||||
|
||||
sources_to_censor = _get_all_censoring_enabled_sources()
|
||||
@@ -64,7 +64,7 @@ def _post_query_chunk_censoring(
|
||||
if chunk.source_type in sources_to_censor:
|
||||
chunks_to_process.setdefault(chunk.source_type, []).append(chunk)
|
||||
else:
|
||||
chunks_to_keep.append(chunk)
|
||||
final_chunk_dict[chunk.unique_id] = chunk
|
||||
|
||||
# For each source, filter out the chunks using the permission
|
||||
# check function for that source
|
||||
@@ -79,6 +79,16 @@ def _post_query_chunk_censoring(
|
||||
f" chunks for this source and continuing: {e}"
|
||||
)
|
||||
continue
|
||||
chunks_to_keep.extend(censored_chunks)
|
||||
|
||||
return chunks_to_keep
|
||||
for censored_chunk in censored_chunks:
|
||||
final_chunk_dict[censored_chunk.unique_id] = censored_chunk
|
||||
|
||||
# IMPORTANT: make sure to retain the same ordering as the original `chunks` passed in
|
||||
final_chunk_list: list[InferenceChunk] = []
|
||||
for chunk in chunks:
|
||||
# only if the chunk is in the final censored chunks, add it to the final list
|
||||
# if it is missing, that means it was intentionally left out
|
||||
if chunk.unique_id in final_chunk_dict:
|
||||
final_chunk_list.append(final_chunk_dict[chunk.unique_id])
|
||||
|
||||
return final_chunk_list
|
||||
|
||||
@@ -51,9 +51,9 @@ def _get_objects_access_for_user_email_from_salesforce(
|
||||
|
||||
# This is cached in the function so the first query takes an extra 0.1-0.3 seconds
|
||||
# but subsequent queries by the same user are essentially instant
|
||||
start_time = time.time()
|
||||
start_time = time.monotonic()
|
||||
user_id = get_salesforce_user_id_from_email(salesforce_client, user_email)
|
||||
end_time = time.time()
|
||||
end_time = time.monotonic()
|
||||
logger.info(
|
||||
f"Time taken to get Salesforce user ID: {end_time - start_time} seconds"
|
||||
)
|
||||
|
||||
@@ -1,10 +1,6 @@
|
||||
from simple_salesforce import Salesforce
|
||||
from sqlalchemy.orm import Session
|
||||
|
||||
from onyx.connectors.salesforce.sqlite_functions import get_user_id_by_email
|
||||
from onyx.connectors.salesforce.sqlite_functions import init_db
|
||||
from onyx.connectors.salesforce.sqlite_functions import NULL_ID_STRING
|
||||
from onyx.connectors.salesforce.sqlite_functions import update_email_to_id_table
|
||||
from onyx.db.connector_credential_pair import get_connector_credential_pair_from_id
|
||||
from onyx.db.document import get_cc_pairs_for_document
|
||||
from onyx.utils.logger import setup_logger
|
||||
@@ -28,6 +24,8 @@ def get_any_salesforce_client_for_doc_id(
|
||||
E.g. there are 2 different credential sets for 2 different salesforce cc_pairs
|
||||
but only one has the permissions to access the permissions needed for the query.
|
||||
"""
|
||||
|
||||
# NOTE: this global seems very very bad
|
||||
global _ANY_SALESFORCE_CLIENT
|
||||
if _ANY_SALESFORCE_CLIENT is None:
|
||||
cc_pairs = get_cc_pairs_for_document(db_session, doc_id)
|
||||
@@ -42,11 +40,18 @@ def get_any_salesforce_client_for_doc_id(
|
||||
|
||||
|
||||
def _query_salesforce_user_id(sf_client: Salesforce, user_email: str) -> str | None:
|
||||
query = f"SELECT Id FROM User WHERE Email = '{user_email}'"
|
||||
query = f"SELECT Id FROM User WHERE Username = '{user_email}' AND IsActive = true"
|
||||
result = sf_client.query(query)
|
||||
if len(result["records"]) == 0:
|
||||
return None
|
||||
return result["records"][0]["Id"]
|
||||
if len(result["records"]) > 0:
|
||||
return result["records"][0]["Id"]
|
||||
|
||||
# try emails
|
||||
query = f"SELECT Id FROM User WHERE Email = '{user_email}' AND IsActive = true"
|
||||
result = sf_client.query(query)
|
||||
if len(result["records"]) > 0:
|
||||
return result["records"][0]["Id"]
|
||||
|
||||
return None
|
||||
|
||||
|
||||
# This contains only the user_ids that we have found in Salesforce.
|
||||
@@ -77,35 +82,21 @@ def get_salesforce_user_id_from_email(
|
||||
salesforce database. (Around 0.1-0.3 seconds)
|
||||
If it's cached or stored in the local salesforce database, it's fast (<0.001 seconds).
|
||||
"""
|
||||
|
||||
# NOTE: this global seems bad
|
||||
global _CACHED_SF_EMAIL_TO_ID_MAP
|
||||
if user_email in _CACHED_SF_EMAIL_TO_ID_MAP:
|
||||
if _CACHED_SF_EMAIL_TO_ID_MAP[user_email] is not None:
|
||||
return _CACHED_SF_EMAIL_TO_ID_MAP[user_email]
|
||||
|
||||
db_exists = True
|
||||
try:
|
||||
# Check if the user is already in the database
|
||||
user_id = get_user_id_by_email(user_email)
|
||||
except Exception:
|
||||
init_db()
|
||||
try:
|
||||
user_id = get_user_id_by_email(user_email)
|
||||
except Exception as e:
|
||||
logger.error(f"Error checking if user is in database: {e}")
|
||||
user_id = None
|
||||
db_exists = False
|
||||
# some caching via sqlite existed here before ... check history if interested
|
||||
|
||||
# ...query Salesforce and store the result in the database
|
||||
user_id = _query_salesforce_user_id(sf_client, user_email)
|
||||
|
||||
# If no entry is found in the database (indicated by user_id being None)...
|
||||
if user_id is None:
|
||||
# ...query Salesforce and store the result in the database
|
||||
user_id = _query_salesforce_user_id(sf_client, user_email)
|
||||
if db_exists:
|
||||
update_email_to_id_table(user_email, user_id)
|
||||
return user_id
|
||||
elif user_id is None:
|
||||
return None
|
||||
elif user_id == NULL_ID_STRING:
|
||||
return None
|
||||
|
||||
# If the found user_id is real, cache it
|
||||
_CACHED_SF_EMAIL_TO_ID_MAP[user_email] = user_id
|
||||
return user_id
|
||||
|
||||
@@ -5,12 +5,14 @@ from slack_sdk import WebClient
|
||||
from ee.onyx.external_permissions.slack.utils import fetch_user_id_to_email_map
|
||||
from onyx.access.models import DocExternalAccess
|
||||
from onyx.access.models import ExternalAccess
|
||||
from onyx.connectors.credentials_provider import OnyxDBCredentialsProvider
|
||||
from onyx.connectors.slack.connector import get_channels
|
||||
from onyx.connectors.slack.connector import make_paginated_slack_api_call_w_retries
|
||||
from onyx.connectors.slack.connector import SlackConnector
|
||||
from onyx.db.models import ConnectorCredentialPair
|
||||
from onyx.indexing.indexing_heartbeat import IndexingHeartbeatInterface
|
||||
from onyx.utils.logger import setup_logger
|
||||
from shared_configs.contextvars import get_current_tenant_id
|
||||
|
||||
|
||||
logger = setup_logger()
|
||||
@@ -101,7 +103,12 @@ def _get_slack_document_access(
|
||||
callback: IndexingHeartbeatInterface | None,
|
||||
) -> Generator[DocExternalAccess, None, None]:
|
||||
slack_connector = SlackConnector(**cc_pair.connector.connector_specific_config)
|
||||
slack_connector.load_credentials(cc_pair.credential.credential_json)
|
||||
|
||||
# Use credentials provider instead of directly loading credentials
|
||||
provider = OnyxDBCredentialsProvider(
|
||||
get_current_tenant_id(), "slack", cc_pair.credential.id
|
||||
)
|
||||
slack_connector.set_credentials_provider(provider)
|
||||
|
||||
slim_doc_generator = slack_connector.retrieve_all_slim_documents(callback=callback)
|
||||
|
||||
|
||||
@@ -51,6 +51,7 @@ def _get_slack_group_members_email(
|
||||
|
||||
|
||||
def slack_group_sync(
|
||||
tenant_id: str,
|
||||
cc_pair: ConnectorCredentialPair,
|
||||
) -> list[ExternalUserGroup]:
|
||||
slack_client = WebClient(
|
||||
|
||||
@@ -15,6 +15,7 @@ from ee.onyx.external_permissions.post_query_censoring import (
|
||||
DOC_SOURCE_TO_CHUNK_CENSORING_FUNCTION,
|
||||
)
|
||||
from ee.onyx.external_permissions.slack.doc_sync import slack_doc_sync
|
||||
from ee.onyx.external_permissions.slack.group_sync import slack_group_sync
|
||||
from onyx.access.models import DocExternalAccess
|
||||
from onyx.configs.constants import DocumentSource
|
||||
from onyx.db.models import ConnectorCredentialPair
|
||||
@@ -56,6 +57,7 @@ DOC_PERMISSIONS_FUNC_MAP: dict[DocumentSource, DocSyncFuncType] = {
|
||||
GROUP_PERMISSIONS_FUNC_MAP: dict[DocumentSource, GroupSyncFuncType] = {
|
||||
DocumentSource.GOOGLE_DRIVE: gdrive_group_sync,
|
||||
DocumentSource.CONFLUENCE: confluence_group_sync,
|
||||
DocumentSource.SLACK: slack_group_sync,
|
||||
}
|
||||
|
||||
|
||||
|
||||
@@ -1,3 +1,4 @@
|
||||
import logging
|
||||
import os
|
||||
import shutil
|
||||
from collections.abc import AsyncGenerator
|
||||
@@ -8,6 +9,7 @@ import sentry_sdk
|
||||
import torch
|
||||
import uvicorn
|
||||
from fastapi import FastAPI
|
||||
from prometheus_fastapi_instrumentator import Instrumentator
|
||||
from sentry_sdk.integrations.fastapi import FastApiIntegration
|
||||
from sentry_sdk.integrations.starlette import StarletteIntegration
|
||||
from transformers import logging as transformer_logging # type:ignore
|
||||
@@ -20,6 +22,8 @@ from model_server.management_endpoints import router as management_router
|
||||
from model_server.utils import get_gpu_type
|
||||
from onyx import __version__
|
||||
from onyx.utils.logger import setup_logger
|
||||
from onyx.utils.logger import setup_uvicorn_logger
|
||||
from onyx.utils.middleware import add_onyx_request_id_middleware
|
||||
from shared_configs.configs import INDEXING_ONLY
|
||||
from shared_configs.configs import MIN_THREADS_ML_MODELS
|
||||
from shared_configs.configs import MODEL_SERVER_ALLOWED_HOST
|
||||
@@ -36,6 +40,12 @@ transformer_logging.set_verbosity_error()
|
||||
|
||||
logger = setup_logger()
|
||||
|
||||
file_handlers = [
|
||||
h for h in logger.logger.handlers if isinstance(h, logging.FileHandler)
|
||||
]
|
||||
|
||||
setup_uvicorn_logger(shared_file_handlers=file_handlers)
|
||||
|
||||
|
||||
def _move_files_recursively(source: Path, dest: Path, overwrite: bool = False) -> None:
|
||||
"""
|
||||
@@ -112,6 +122,15 @@ def get_model_app() -> FastAPI:
|
||||
application.include_router(encoders_router)
|
||||
application.include_router(custom_models_router)
|
||||
|
||||
request_id_prefix = "INF"
|
||||
if INDEXING_ONLY:
|
||||
request_id_prefix = "IDX"
|
||||
|
||||
add_onyx_request_id_middleware(application, request_id_prefix, logger)
|
||||
|
||||
# Initialize and instrument the app
|
||||
Instrumentator().instrument(application).expose(application)
|
||||
|
||||
return application
|
||||
|
||||
|
||||
|
||||
@@ -15,6 +15,22 @@ class ExternalAccess:
|
||||
# Whether the document is public in the external system or Onyx
|
||||
is_public: bool
|
||||
|
||||
def __str__(self) -> str:
|
||||
"""Prevent extremely long logs"""
|
||||
|
||||
def truncate_set(s: set[str], max_len: int = 100) -> str:
|
||||
s_str = str(s)
|
||||
if len(s_str) > max_len:
|
||||
return f"{s_str[:max_len]}... ({len(s)} items)"
|
||||
return s_str
|
||||
|
||||
return (
|
||||
f"ExternalAccess("
|
||||
f"external_user_emails={truncate_set(self.external_user_emails)}, "
|
||||
f"external_user_group_ids={truncate_set(self.external_user_group_ids)}, "
|
||||
f"is_public={self.is_public})"
|
||||
)
|
||||
|
||||
|
||||
@dataclass(frozen=True)
|
||||
class DocExternalAccess:
|
||||
|
||||
62
backend/onyx/agents/agent_search/dc_search_analysis/edges.py
Normal file
62
backend/onyx/agents/agent_search/dc_search_analysis/edges.py
Normal file
@@ -0,0 +1,62 @@
|
||||
from collections.abc import Hashable
|
||||
from typing import cast
|
||||
|
||||
from langchain_core.runnables.config import RunnableConfig
|
||||
from langgraph.types import Send
|
||||
|
||||
from onyx.agents.agent_search.dc_search_analysis.states import ObjectInformationInput
|
||||
from onyx.agents.agent_search.dc_search_analysis.states import (
|
||||
ObjectResearchInformationUpdate,
|
||||
)
|
||||
from onyx.agents.agent_search.dc_search_analysis.states import ObjectSourceInput
|
||||
from onyx.agents.agent_search.dc_search_analysis.states import (
|
||||
SearchSourcesObjectsUpdate,
|
||||
)
|
||||
from onyx.agents.agent_search.models import GraphConfig
|
||||
|
||||
|
||||
def parallel_object_source_research_edge(
|
||||
state: SearchSourcesObjectsUpdate, config: RunnableConfig
|
||||
) -> list[Send | Hashable]:
|
||||
"""
|
||||
LangGraph edge to parallelize the research for an individual object and source
|
||||
"""
|
||||
|
||||
search_objects = state.analysis_objects
|
||||
search_sources = state.analysis_sources
|
||||
|
||||
object_source_combinations = [
|
||||
(object, source) for object in search_objects for source in search_sources
|
||||
]
|
||||
|
||||
return [
|
||||
Send(
|
||||
"research_object_source",
|
||||
ObjectSourceInput(
|
||||
object_source_combination=object_source_combination,
|
||||
log_messages=[],
|
||||
),
|
||||
)
|
||||
for object_source_combination in object_source_combinations
|
||||
]
|
||||
|
||||
|
||||
def parallel_object_research_consolidation_edge(
|
||||
state: ObjectResearchInformationUpdate, config: RunnableConfig
|
||||
) -> list[Send | Hashable]:
|
||||
"""
|
||||
LangGraph edge to parallelize the research for an individual object and source
|
||||
"""
|
||||
cast(GraphConfig, config["metadata"]["config"])
|
||||
object_research_information_results = state.object_research_information_results
|
||||
|
||||
return [
|
||||
Send(
|
||||
"consolidate_object_research",
|
||||
ObjectInformationInput(
|
||||
object_information=object_information,
|
||||
log_messages=[],
|
||||
),
|
||||
)
|
||||
for object_information in object_research_information_results
|
||||
]
|
||||
@@ -0,0 +1,103 @@
|
||||
from langgraph.graph import END
|
||||
from langgraph.graph import START
|
||||
from langgraph.graph import StateGraph
|
||||
|
||||
from onyx.agents.agent_search.dc_search_analysis.edges import (
|
||||
parallel_object_research_consolidation_edge,
|
||||
)
|
||||
from onyx.agents.agent_search.dc_search_analysis.edges import (
|
||||
parallel_object_source_research_edge,
|
||||
)
|
||||
from onyx.agents.agent_search.dc_search_analysis.nodes.a1_search_objects import (
|
||||
search_objects,
|
||||
)
|
||||
from onyx.agents.agent_search.dc_search_analysis.nodes.a2_research_object_source import (
|
||||
research_object_source,
|
||||
)
|
||||
from onyx.agents.agent_search.dc_search_analysis.nodes.a3_structure_research_by_object import (
|
||||
structure_research_by_object,
|
||||
)
|
||||
from onyx.agents.agent_search.dc_search_analysis.nodes.a4_consolidate_object_research import (
|
||||
consolidate_object_research,
|
||||
)
|
||||
from onyx.agents.agent_search.dc_search_analysis.nodes.a5_consolidate_research import (
|
||||
consolidate_research,
|
||||
)
|
||||
from onyx.agents.agent_search.dc_search_analysis.states import MainInput
|
||||
from onyx.agents.agent_search.dc_search_analysis.states import MainState
|
||||
from onyx.utils.logger import setup_logger
|
||||
|
||||
logger = setup_logger()
|
||||
|
||||
test_mode = False
|
||||
|
||||
|
||||
def divide_and_conquer_graph_builder(test_mode: bool = False) -> StateGraph:
|
||||
"""
|
||||
LangGraph graph builder for the knowledge graph search process.
|
||||
"""
|
||||
|
||||
graph = StateGraph(
|
||||
state_schema=MainState,
|
||||
input=MainInput,
|
||||
)
|
||||
|
||||
### Add nodes ###
|
||||
|
||||
graph.add_node(
|
||||
"search_objects",
|
||||
search_objects,
|
||||
)
|
||||
|
||||
graph.add_node(
|
||||
"structure_research_by_source",
|
||||
structure_research_by_object,
|
||||
)
|
||||
|
||||
graph.add_node(
|
||||
"research_object_source",
|
||||
research_object_source,
|
||||
)
|
||||
|
||||
graph.add_node(
|
||||
"consolidate_object_research",
|
||||
consolidate_object_research,
|
||||
)
|
||||
|
||||
graph.add_node(
|
||||
"consolidate_research",
|
||||
consolidate_research,
|
||||
)
|
||||
|
||||
### Add edges ###
|
||||
|
||||
graph.add_edge(start_key=START, end_key="search_objects")
|
||||
|
||||
graph.add_conditional_edges(
|
||||
source="search_objects",
|
||||
path=parallel_object_source_research_edge,
|
||||
path_map=["research_object_source"],
|
||||
)
|
||||
|
||||
graph.add_edge(
|
||||
start_key="research_object_source",
|
||||
end_key="structure_research_by_source",
|
||||
)
|
||||
|
||||
graph.add_conditional_edges(
|
||||
source="structure_research_by_source",
|
||||
path=parallel_object_research_consolidation_edge,
|
||||
path_map=["consolidate_object_research"],
|
||||
)
|
||||
|
||||
graph.add_edge(
|
||||
start_key="consolidate_object_research",
|
||||
end_key="consolidate_research",
|
||||
)
|
||||
|
||||
graph.add_edge(
|
||||
start_key="consolidate_research",
|
||||
end_key=END,
|
||||
)
|
||||
|
||||
return graph
|
||||
@@ -0,0 +1,159 @@
|
||||
from typing import cast
|
||||
|
||||
from langchain_core.messages import HumanMessage
|
||||
from langchain_core.runnables import RunnableConfig
|
||||
from langgraph.types import StreamWriter
|
||||
|
||||
from onyx.agents.agent_search.dc_search_analysis.ops import extract_section
|
||||
from onyx.agents.agent_search.dc_search_analysis.ops import research
|
||||
from onyx.agents.agent_search.dc_search_analysis.states import MainState
|
||||
from onyx.agents.agent_search.dc_search_analysis.states import (
|
||||
SearchSourcesObjectsUpdate,
|
||||
)
|
||||
from onyx.agents.agent_search.models import GraphConfig
|
||||
from onyx.agents.agent_search.shared_graph_utils.agent_prompt_ops import (
|
||||
trim_prompt_piece,
|
||||
)
|
||||
from onyx.agents.agent_search.shared_graph_utils.utils import write_custom_event
|
||||
from onyx.chat.models import AgentAnswerPiece
|
||||
from onyx.configs.constants import DocumentSource
|
||||
from onyx.prompts.agents.dc_prompts import DC_OBJECT_NO_BASE_DATA_EXTRACTION_PROMPT
|
||||
from onyx.prompts.agents.dc_prompts import DC_OBJECT_SEPARATOR
|
||||
from onyx.prompts.agents.dc_prompts import DC_OBJECT_WITH_BASE_DATA_EXTRACTION_PROMPT
|
||||
from onyx.utils.logger import setup_logger
|
||||
from onyx.utils.threadpool_concurrency import run_with_timeout
|
||||
|
||||
logger = setup_logger()
|
||||
|
||||
|
||||
def search_objects(
|
||||
state: MainState, config: RunnableConfig, writer: StreamWriter = lambda _: None
|
||||
) -> SearchSourcesObjectsUpdate:
|
||||
"""
|
||||
LangGraph node to start the agentic search process.
|
||||
"""
|
||||
|
||||
graph_config = cast(GraphConfig, config["metadata"]["config"])
|
||||
question = graph_config.inputs.search_request.query
|
||||
search_tool = graph_config.tooling.search_tool
|
||||
|
||||
if search_tool is None or graph_config.inputs.search_request.persona is None:
|
||||
raise ValueError("Search tool and persona must be provided for DivCon search")
|
||||
|
||||
try:
|
||||
instructions = graph_config.inputs.search_request.persona.prompts[
|
||||
0
|
||||
].system_prompt
|
||||
|
||||
agent_1_instructions = extract_section(
|
||||
instructions, "Agent Step 1:", "Agent Step 2:"
|
||||
)
|
||||
if agent_1_instructions is None:
|
||||
raise ValueError("Agent 1 instructions not found")
|
||||
|
||||
agent_1_base_data = extract_section(instructions, "|Start Data|", "|End Data|")
|
||||
|
||||
agent_1_task = extract_section(
|
||||
agent_1_instructions, "Task:", "Independent Research Sources:"
|
||||
)
|
||||
if agent_1_task is None:
|
||||
raise ValueError("Agent 1 task not found")
|
||||
|
||||
agent_1_independent_sources_str = extract_section(
|
||||
agent_1_instructions, "Independent Research Sources:", "Output Objective:"
|
||||
)
|
||||
if agent_1_independent_sources_str is None:
|
||||
raise ValueError("Agent 1 Independent Research Sources not found")
|
||||
|
||||
document_sources = [
|
||||
DocumentSource(x.strip().lower())
|
||||
for x in agent_1_independent_sources_str.split(DC_OBJECT_SEPARATOR)
|
||||
]
|
||||
|
||||
agent_1_output_objective = extract_section(
|
||||
agent_1_instructions, "Output Objective:"
|
||||
)
|
||||
if agent_1_output_objective is None:
|
||||
raise ValueError("Agent 1 output objective not found")
|
||||
|
||||
except Exception as e:
|
||||
raise ValueError(
|
||||
f"Agent 1 instructions not found or not formatted correctly: {e}"
|
||||
)
|
||||
|
||||
# Extract objects
|
||||
|
||||
if agent_1_base_data is None:
|
||||
# Retrieve chunks for objects
|
||||
|
||||
retrieved_docs = research(question, search_tool)[:10]
|
||||
|
||||
document_texts_list = []
|
||||
for doc_num, doc in enumerate(retrieved_docs):
|
||||
chunk_text = "Document " + str(doc_num) + ":\n" + doc.content
|
||||
document_texts_list.append(chunk_text)
|
||||
|
||||
document_texts = "\n\n".join(document_texts_list)
|
||||
|
||||
dc_object_extraction_prompt = DC_OBJECT_NO_BASE_DATA_EXTRACTION_PROMPT.format(
|
||||
question=question,
|
||||
task=agent_1_task,
|
||||
document_text=document_texts,
|
||||
objects_of_interest=agent_1_output_objective,
|
||||
)
|
||||
else:
|
||||
dc_object_extraction_prompt = DC_OBJECT_WITH_BASE_DATA_EXTRACTION_PROMPT.format(
|
||||
question=question,
|
||||
task=agent_1_task,
|
||||
base_data=agent_1_base_data,
|
||||
objects_of_interest=agent_1_output_objective,
|
||||
)
|
||||
|
||||
msg = [
|
||||
HumanMessage(
|
||||
content=trim_prompt_piece(
|
||||
config=graph_config.tooling.primary_llm.config,
|
||||
prompt_piece=dc_object_extraction_prompt,
|
||||
reserved_str="",
|
||||
),
|
||||
)
|
||||
]
|
||||
primary_llm = graph_config.tooling.primary_llm
|
||||
# Grader
|
||||
try:
|
||||
llm_response = run_with_timeout(
|
||||
30,
|
||||
primary_llm.invoke,
|
||||
prompt=msg,
|
||||
timeout_override=30,
|
||||
max_tokens=300,
|
||||
)
|
||||
|
||||
cleaned_response = (
|
||||
str(llm_response.content)
|
||||
.replace("```json\n", "")
|
||||
.replace("\n```", "")
|
||||
.replace("\n", "")
|
||||
)
|
||||
cleaned_response = cleaned_response.split("OBJECTS:")[1]
|
||||
object_list = [x.strip() for x in cleaned_response.split(";")]
|
||||
|
||||
except Exception as e:
|
||||
raise ValueError(f"Error in search_objects: {e}")
|
||||
|
||||
write_custom_event(
|
||||
"initial_agent_answer",
|
||||
AgentAnswerPiece(
|
||||
answer_piece=" Researching the individual objects for each source type... ",
|
||||
level=0,
|
||||
level_question_num=0,
|
||||
answer_type="agent_level_answer",
|
||||
),
|
||||
writer,
|
||||
)
|
||||
|
||||
return SearchSourcesObjectsUpdate(
|
||||
analysis_objects=object_list,
|
||||
analysis_sources=document_sources,
|
||||
log_messages=["Agent 1 Task done"],
|
||||
)
|
||||
@@ -0,0 +1,185 @@
|
||||
from datetime import datetime
|
||||
from datetime import timedelta
|
||||
from datetime import timezone
|
||||
from typing import cast
|
||||
|
||||
from langchain_core.messages import HumanMessage
|
||||
from langchain_core.runnables import RunnableConfig
|
||||
from langgraph.types import StreamWriter
|
||||
|
||||
from onyx.agents.agent_search.dc_search_analysis.ops import extract_section
|
||||
from onyx.agents.agent_search.dc_search_analysis.ops import research
|
||||
from onyx.agents.agent_search.dc_search_analysis.states import ObjectSourceInput
|
||||
from onyx.agents.agent_search.dc_search_analysis.states import (
|
||||
ObjectSourceResearchUpdate,
|
||||
)
|
||||
from onyx.agents.agent_search.models import GraphConfig
|
||||
from onyx.agents.agent_search.shared_graph_utils.agent_prompt_ops import (
|
||||
trim_prompt_piece,
|
||||
)
|
||||
from onyx.prompts.agents.dc_prompts import DC_OBJECT_SOURCE_RESEARCH_PROMPT
|
||||
from onyx.utils.logger import setup_logger
|
||||
from onyx.utils.threadpool_concurrency import run_with_timeout
|
||||
|
||||
logger = setup_logger()
|
||||
|
||||
|
||||
def research_object_source(
|
||||
state: ObjectSourceInput,
|
||||
config: RunnableConfig,
|
||||
writer: StreamWriter = lambda _: None,
|
||||
) -> ObjectSourceResearchUpdate:
|
||||
"""
|
||||
LangGraph node to start the agentic search process.
|
||||
"""
|
||||
datetime.now()
|
||||
|
||||
graph_config = cast(GraphConfig, config["metadata"]["config"])
|
||||
graph_config.inputs.search_request.query
|
||||
search_tool = graph_config.tooling.search_tool
|
||||
question = graph_config.inputs.search_request.query
|
||||
object, document_source = state.object_source_combination
|
||||
|
||||
if search_tool is None or graph_config.inputs.search_request.persona is None:
|
||||
raise ValueError("Search tool and persona must be provided for DivCon search")
|
||||
|
||||
try:
|
||||
instructions = graph_config.inputs.search_request.persona.prompts[
|
||||
0
|
||||
].system_prompt
|
||||
|
||||
agent_2_instructions = extract_section(
|
||||
instructions, "Agent Step 2:", "Agent Step 3:"
|
||||
)
|
||||
if agent_2_instructions is None:
|
||||
raise ValueError("Agent 2 instructions not found")
|
||||
|
||||
agent_2_task = extract_section(
|
||||
agent_2_instructions, "Task:", "Independent Research Sources:"
|
||||
)
|
||||
if agent_2_task is None:
|
||||
raise ValueError("Agent 2 task not found")
|
||||
|
||||
agent_2_time_cutoff = extract_section(
|
||||
agent_2_instructions, "Time Cutoff:", "Research Topics:"
|
||||
)
|
||||
|
||||
agent_2_research_topics = extract_section(
|
||||
agent_2_instructions, "Research Topics:", "Output Objective"
|
||||
)
|
||||
|
||||
agent_2_output_objective = extract_section(
|
||||
agent_2_instructions, "Output Objective:"
|
||||
)
|
||||
if agent_2_output_objective is None:
|
||||
raise ValueError("Agent 2 output objective not found")
|
||||
|
||||
except Exception:
|
||||
raise ValueError(
|
||||
"Agent 1 instructions not found or not formatted correctly: {e}"
|
||||
)
|
||||
|
||||
# Populate prompt
|
||||
|
||||
# Retrieve chunks for objects
|
||||
|
||||
if agent_2_time_cutoff is not None and agent_2_time_cutoff.strip() != "":
|
||||
if agent_2_time_cutoff.strip().endswith("d"):
|
||||
try:
|
||||
days = int(agent_2_time_cutoff.strip()[:-1])
|
||||
agent_2_source_start_time = datetime.now(timezone.utc) - timedelta(
|
||||
days=days
|
||||
)
|
||||
except ValueError:
|
||||
raise ValueError(
|
||||
f"Invalid time cutoff format: {agent_2_time_cutoff}. Expected format: '<number>d'"
|
||||
)
|
||||
else:
|
||||
raise ValueError(
|
||||
f"Invalid time cutoff format: {agent_2_time_cutoff}. Expected format: '<number>d'"
|
||||
)
|
||||
else:
|
||||
agent_2_source_start_time = None
|
||||
|
||||
document_sources = [document_source] if document_source else None
|
||||
|
||||
if len(question.strip()) > 0:
|
||||
research_area = f"{question} for {object}"
|
||||
elif agent_2_research_topics and len(agent_2_research_topics.strip()) > 0:
|
||||
research_area = f"{agent_2_research_topics} for {object}"
|
||||
else:
|
||||
research_area = object
|
||||
|
||||
retrieved_docs = research(
|
||||
question=research_area,
|
||||
search_tool=search_tool,
|
||||
document_sources=document_sources,
|
||||
time_cutoff=agent_2_source_start_time,
|
||||
)
|
||||
|
||||
# Generate document text
|
||||
|
||||
document_texts_list = []
|
||||
for doc_num, doc in enumerate(retrieved_docs):
|
||||
chunk_text = "Document " + str(doc_num) + ":\n" + doc.content
|
||||
document_texts_list.append(chunk_text)
|
||||
|
||||
document_texts = "\n\n".join(document_texts_list)
|
||||
|
||||
# Built prompt
|
||||
|
||||
today = datetime.now().strftime("%A, %Y-%m-%d")
|
||||
|
||||
dc_object_source_research_prompt = (
|
||||
DC_OBJECT_SOURCE_RESEARCH_PROMPT.format(
|
||||
today=today,
|
||||
question=question,
|
||||
task=agent_2_task,
|
||||
document_text=document_texts,
|
||||
format=agent_2_output_objective,
|
||||
)
|
||||
.replace("---object---", object)
|
||||
.replace("---source---", document_source.value)
|
||||
)
|
||||
|
||||
# Run LLM
|
||||
|
||||
msg = [
|
||||
HumanMessage(
|
||||
content=trim_prompt_piece(
|
||||
config=graph_config.tooling.primary_llm.config,
|
||||
prompt_piece=dc_object_source_research_prompt,
|
||||
reserved_str="",
|
||||
),
|
||||
)
|
||||
]
|
||||
# fast_llm = graph_config.tooling.fast_llm
|
||||
primary_llm = graph_config.tooling.primary_llm
|
||||
llm = primary_llm
|
||||
# Grader
|
||||
try:
|
||||
llm_response = run_with_timeout(
|
||||
30,
|
||||
llm.invoke,
|
||||
prompt=msg,
|
||||
timeout_override=30,
|
||||
max_tokens=300,
|
||||
)
|
||||
|
||||
cleaned_response = str(llm_response.content).replace("```json\n", "")
|
||||
cleaned_response = cleaned_response.split("RESEARCH RESULTS:")[1]
|
||||
object_research_results = {
|
||||
"object": object,
|
||||
"source": document_source.value,
|
||||
"research_result": cleaned_response,
|
||||
}
|
||||
|
||||
except Exception as e:
|
||||
raise ValueError(f"Error in research_object_source: {e}")
|
||||
|
||||
logger.debug("DivCon Step A2 - Object Source Research - completed for an object")
|
||||
|
||||
return ObjectSourceResearchUpdate(
|
||||
object_source_research_results=[object_research_results],
|
||||
log_messages=["Agent Step 2 done for one object"],
|
||||
)
|
||||
@@ -0,0 +1,68 @@
|
||||
from collections import defaultdict
|
||||
from datetime import datetime
|
||||
from typing import cast
|
||||
from typing import Dict
|
||||
from typing import List
|
||||
|
||||
from langchain_core.runnables import RunnableConfig
|
||||
from langgraph.types import StreamWriter
|
||||
|
||||
from onyx.agents.agent_search.dc_search_analysis.states import MainState
|
||||
from onyx.agents.agent_search.dc_search_analysis.states import (
|
||||
ObjectResearchInformationUpdate,
|
||||
)
|
||||
from onyx.agents.agent_search.models import GraphConfig
|
||||
from onyx.agents.agent_search.shared_graph_utils.utils import write_custom_event
|
||||
from onyx.chat.models import AgentAnswerPiece
|
||||
from onyx.utils.logger import setup_logger
|
||||
|
||||
logger = setup_logger()
|
||||
|
||||
|
||||
def structure_research_by_object(
|
||||
state: MainState, config: RunnableConfig, writer: StreamWriter = lambda _: None
|
||||
) -> ObjectResearchInformationUpdate:
|
||||
"""
|
||||
LangGraph node to start the agentic search process.
|
||||
"""
|
||||
datetime.now()
|
||||
|
||||
graph_config = cast(GraphConfig, config["metadata"]["config"])
|
||||
graph_config.inputs.search_request.query
|
||||
|
||||
write_custom_event(
|
||||
"initial_agent_answer",
|
||||
AgentAnswerPiece(
|
||||
answer_piece=" consolidating the information across source types for each object...",
|
||||
level=0,
|
||||
level_question_num=0,
|
||||
answer_type="agent_level_answer",
|
||||
),
|
||||
writer,
|
||||
)
|
||||
|
||||
object_source_research_results = state.object_source_research_results
|
||||
|
||||
object_research_information_results: List[Dict[str, str]] = []
|
||||
object_research_information_results_list: Dict[str, List[str]] = defaultdict(list)
|
||||
|
||||
for object_source_research in object_source_research_results:
|
||||
object = object_source_research["object"]
|
||||
source = object_source_research["source"]
|
||||
research_result = object_source_research["research_result"]
|
||||
|
||||
object_research_information_results_list[object].append(
|
||||
f"Source: {source}\n{research_result}"
|
||||
)
|
||||
|
||||
for object, information in object_research_information_results_list.items():
|
||||
object_research_information_results.append(
|
||||
{"object": object, "information": "\n".join(information)}
|
||||
)
|
||||
|
||||
logger.debug("DivCon Step A3 - Object Research Information Structuring - completed")
|
||||
|
||||
return ObjectResearchInformationUpdate(
|
||||
object_research_information_results=object_research_information_results,
|
||||
log_messages=["A3 - Object Research Information structured"],
|
||||
)
|
||||
@@ -0,0 +1,107 @@
|
||||
from typing import cast
|
||||
|
||||
from langchain_core.messages import HumanMessage
|
||||
from langchain_core.runnables import RunnableConfig
|
||||
from langgraph.types import StreamWriter
|
||||
|
||||
from onyx.agents.agent_search.dc_search_analysis.ops import extract_section
|
||||
from onyx.agents.agent_search.dc_search_analysis.states import ObjectInformationInput
|
||||
from onyx.agents.agent_search.dc_search_analysis.states import ObjectResearchUpdate
|
||||
from onyx.agents.agent_search.models import GraphConfig
|
||||
from onyx.agents.agent_search.shared_graph_utils.agent_prompt_ops import (
|
||||
trim_prompt_piece,
|
||||
)
|
||||
from onyx.prompts.agents.dc_prompts import DC_OBJECT_CONSOLIDATION_PROMPT
|
||||
from onyx.utils.logger import setup_logger
|
||||
from onyx.utils.threadpool_concurrency import run_with_timeout
|
||||
|
||||
logger = setup_logger()
|
||||
|
||||
|
||||
def consolidate_object_research(
|
||||
state: ObjectInformationInput,
|
||||
config: RunnableConfig,
|
||||
writer: StreamWriter = lambda _: None,
|
||||
) -> ObjectResearchUpdate:
|
||||
"""
|
||||
LangGraph node to start the agentic search process.
|
||||
"""
|
||||
graph_config = cast(GraphConfig, config["metadata"]["config"])
|
||||
graph_config.inputs.search_request.query
|
||||
search_tool = graph_config.tooling.search_tool
|
||||
question = graph_config.inputs.search_request.query
|
||||
|
||||
if search_tool is None or graph_config.inputs.search_request.persona is None:
|
||||
raise ValueError("Search tool and persona must be provided for DivCon search")
|
||||
|
||||
instructions = graph_config.inputs.search_request.persona.prompts[0].system_prompt
|
||||
|
||||
agent_4_instructions = extract_section(
|
||||
instructions, "Agent Step 4:", "Agent Step 5:"
|
||||
)
|
||||
if agent_4_instructions is None:
|
||||
raise ValueError("Agent 4 instructions not found")
|
||||
agent_4_output_objective = extract_section(
|
||||
agent_4_instructions, "Output Objective:"
|
||||
)
|
||||
if agent_4_output_objective is None:
|
||||
raise ValueError("Agent 4 output objective not found")
|
||||
|
||||
object_information = state.object_information
|
||||
|
||||
object = object_information["object"]
|
||||
information = object_information["information"]
|
||||
|
||||
# Create a prompt for the object consolidation
|
||||
|
||||
dc_object_consolidation_prompt = DC_OBJECT_CONSOLIDATION_PROMPT.format(
|
||||
question=question,
|
||||
object=object,
|
||||
information=information,
|
||||
format=agent_4_output_objective,
|
||||
)
|
||||
|
||||
# Run LLM
|
||||
|
||||
msg = [
|
||||
HumanMessage(
|
||||
content=trim_prompt_piece(
|
||||
config=graph_config.tooling.primary_llm.config,
|
||||
prompt_piece=dc_object_consolidation_prompt,
|
||||
reserved_str="",
|
||||
),
|
||||
)
|
||||
]
|
||||
graph_config.tooling.primary_llm
|
||||
# fast_llm = graph_config.tooling.fast_llm
|
||||
primary_llm = graph_config.tooling.primary_llm
|
||||
llm = primary_llm
|
||||
# Grader
|
||||
try:
|
||||
llm_response = run_with_timeout(
|
||||
30,
|
||||
llm.invoke,
|
||||
prompt=msg,
|
||||
timeout_override=30,
|
||||
max_tokens=300,
|
||||
)
|
||||
|
||||
cleaned_response = str(llm_response.content).replace("```json\n", "")
|
||||
consolidated_information = cleaned_response.split("INFORMATION:")[1]
|
||||
|
||||
except Exception as e:
|
||||
raise ValueError(f"Error in consolidate_object_research: {e}")
|
||||
|
||||
object_research_results = {
|
||||
"object": object,
|
||||
"research_result": consolidated_information,
|
||||
}
|
||||
|
||||
logger.debug(
|
||||
"DivCon Step A4 - Object Research Consolidation - completed for an object"
|
||||
)
|
||||
|
||||
return ObjectResearchUpdate(
|
||||
object_research_results=[object_research_results],
|
||||
log_messages=["Agent Source Consilidation done"],
|
||||
)
|
||||
@@ -0,0 +1,164 @@
|
||||
from datetime import datetime
|
||||
from typing import cast
|
||||
|
||||
from langchain_core.messages import HumanMessage
|
||||
from langchain_core.runnables import RunnableConfig
|
||||
from langgraph.types import StreamWriter
|
||||
|
||||
from onyx.agents.agent_search.dc_search_analysis.ops import extract_section
|
||||
from onyx.agents.agent_search.dc_search_analysis.states import MainState
|
||||
from onyx.agents.agent_search.dc_search_analysis.states import ResearchUpdate
|
||||
from onyx.agents.agent_search.models import GraphConfig
|
||||
from onyx.agents.agent_search.shared_graph_utils.agent_prompt_ops import (
|
||||
trim_prompt_piece,
|
||||
)
|
||||
from onyx.agents.agent_search.shared_graph_utils.utils import write_custom_event
|
||||
from onyx.chat.models import AgentAnswerPiece
|
||||
from onyx.prompts.agents.dc_prompts import DC_FORMATTING_NO_BASE_DATA_PROMPT
|
||||
from onyx.prompts.agents.dc_prompts import DC_FORMATTING_WITH_BASE_DATA_PROMPT
|
||||
from onyx.utils.logger import setup_logger
|
||||
from onyx.utils.threadpool_concurrency import run_with_timeout
|
||||
|
||||
logger = setup_logger()
|
||||
|
||||
|
||||
def consolidate_research(
|
||||
state: MainState, config: RunnableConfig, writer: StreamWriter = lambda _: None
|
||||
) -> ResearchUpdate:
|
||||
"""
|
||||
LangGraph node to start the agentic search process.
|
||||
"""
|
||||
datetime.now()
|
||||
|
||||
graph_config = cast(GraphConfig, config["metadata"]["config"])
|
||||
graph_config.inputs.search_request.query
|
||||
|
||||
search_tool = graph_config.tooling.search_tool
|
||||
|
||||
write_custom_event(
|
||||
"initial_agent_answer",
|
||||
AgentAnswerPiece(
|
||||
answer_piece=" generating the answer\n\n\n",
|
||||
level=0,
|
||||
level_question_num=0,
|
||||
answer_type="agent_level_answer",
|
||||
),
|
||||
writer,
|
||||
)
|
||||
|
||||
if search_tool is None or graph_config.inputs.search_request.persona is None:
|
||||
raise ValueError("Search tool and persona must be provided for DivCon search")
|
||||
|
||||
# Populate prompt
|
||||
instructions = graph_config.inputs.search_request.persona.prompts[0].system_prompt
|
||||
|
||||
try:
|
||||
agent_5_instructions = extract_section(
|
||||
instructions, "Agent Step 5:", "Agent End"
|
||||
)
|
||||
if agent_5_instructions is None:
|
||||
raise ValueError("Agent 5 instructions not found")
|
||||
agent_5_base_data = extract_section(instructions, "|Start Data|", "|End Data|")
|
||||
agent_5_task = extract_section(
|
||||
agent_5_instructions, "Task:", "Independent Research Sources:"
|
||||
)
|
||||
if agent_5_task is None:
|
||||
raise ValueError("Agent 5 task not found")
|
||||
agent_5_output_objective = extract_section(
|
||||
agent_5_instructions, "Output Objective:"
|
||||
)
|
||||
if agent_5_output_objective is None:
|
||||
raise ValueError("Agent 5 output objective not found")
|
||||
except ValueError as e:
|
||||
raise ValueError(
|
||||
f"Instructions for Agent Step 5 were not properly formatted: {e}"
|
||||
)
|
||||
|
||||
research_result_list = []
|
||||
|
||||
if agent_5_task.strip() == "*concatenate*":
|
||||
object_research_results = state.object_research_results
|
||||
|
||||
for object_research_result in object_research_results:
|
||||
object = object_research_result["object"]
|
||||
research_result = object_research_result["research_result"]
|
||||
research_result_list.append(f"Object: {object}\n\n{research_result}")
|
||||
|
||||
research_results = "\n\n".join(research_result_list)
|
||||
|
||||
else:
|
||||
raise NotImplementedError("Only '*concatenate*' is currently supported")
|
||||
|
||||
# Create a prompt for the object consolidation
|
||||
|
||||
if agent_5_base_data is None:
|
||||
dc_formatting_prompt = DC_FORMATTING_NO_BASE_DATA_PROMPT.format(
|
||||
text=research_results,
|
||||
format=agent_5_output_objective,
|
||||
)
|
||||
else:
|
||||
dc_formatting_prompt = DC_FORMATTING_WITH_BASE_DATA_PROMPT.format(
|
||||
base_data=agent_5_base_data,
|
||||
text=research_results,
|
||||
format=agent_5_output_objective,
|
||||
)
|
||||
|
||||
# Run LLM
|
||||
|
||||
msg = [
|
||||
HumanMessage(
|
||||
content=trim_prompt_piece(
|
||||
config=graph_config.tooling.primary_llm.config,
|
||||
prompt_piece=dc_formatting_prompt,
|
||||
reserved_str="",
|
||||
),
|
||||
)
|
||||
]
|
||||
|
||||
dispatch_timings: list[float] = []
|
||||
|
||||
primary_model = graph_config.tooling.primary_llm
|
||||
|
||||
def stream_initial_answer() -> list[str]:
|
||||
response: list[str] = []
|
||||
for message in primary_model.stream(msg, timeout_override=30, max_tokens=None):
|
||||
# TODO: in principle, the answer here COULD contain images, but we don't support that yet
|
||||
content = message.content
|
||||
if not isinstance(content, str):
|
||||
raise ValueError(
|
||||
f"Expected content to be a string, but got {type(content)}"
|
||||
)
|
||||
start_stream_token = datetime.now()
|
||||
|
||||
write_custom_event(
|
||||
"initial_agent_answer",
|
||||
AgentAnswerPiece(
|
||||
answer_piece=content,
|
||||
level=0,
|
||||
level_question_num=0,
|
||||
answer_type="agent_level_answer",
|
||||
),
|
||||
writer,
|
||||
)
|
||||
end_stream_token = datetime.now()
|
||||
dispatch_timings.append(
|
||||
(end_stream_token - start_stream_token).microseconds
|
||||
)
|
||||
response.append(content)
|
||||
return response
|
||||
|
||||
try:
|
||||
_ = run_with_timeout(
|
||||
60,
|
||||
stream_initial_answer,
|
||||
)
|
||||
|
||||
except Exception as e:
|
||||
raise ValueError(f"Error in consolidate_research: {e}")
|
||||
|
||||
logger.debug("DivCon Step A5 - Final Generation - completed")
|
||||
|
||||
return ResearchUpdate(
|
||||
research_results=research_results,
|
||||
log_messages=["Agent Source Consilidation done"],
|
||||
)
|
||||
61
backend/onyx/agents/agent_search/dc_search_analysis/ops.py
Normal file
61
backend/onyx/agents/agent_search/dc_search_analysis/ops.py
Normal file
@@ -0,0 +1,61 @@
|
||||
from datetime import datetime
|
||||
from typing import cast
|
||||
|
||||
from onyx.chat.models import LlmDoc
|
||||
from onyx.configs.constants import DocumentSource
|
||||
from onyx.context.search.models import InferenceSection
|
||||
from onyx.db.engine import get_session_with_current_tenant
|
||||
from onyx.tools.models import SearchToolOverrideKwargs
|
||||
from onyx.tools.tool_implementations.search.search_tool import (
|
||||
FINAL_CONTEXT_DOCUMENTS_ID,
|
||||
)
|
||||
from onyx.tools.tool_implementations.search.search_tool import SearchTool
|
||||
|
||||
|
||||
def research(
|
||||
question: str,
|
||||
search_tool: SearchTool,
|
||||
document_sources: list[DocumentSource] | None = None,
|
||||
time_cutoff: datetime | None = None,
|
||||
) -> list[LlmDoc]:
|
||||
# new db session to avoid concurrency issues
|
||||
|
||||
callback_container: list[list[InferenceSection]] = []
|
||||
retrieved_docs: list[LlmDoc] = []
|
||||
|
||||
with get_session_with_current_tenant() as db_session:
|
||||
for tool_response in search_tool.run(
|
||||
query=question,
|
||||
override_kwargs=SearchToolOverrideKwargs(
|
||||
force_no_rerank=False,
|
||||
alternate_db_session=db_session,
|
||||
retrieved_sections_callback=callback_container.append,
|
||||
skip_query_analysis=True,
|
||||
document_sources=document_sources,
|
||||
time_cutoff=time_cutoff,
|
||||
),
|
||||
):
|
||||
# get retrieved docs to send to the rest of the graph
|
||||
if tool_response.id == FINAL_CONTEXT_DOCUMENTS_ID:
|
||||
retrieved_docs = cast(list[LlmDoc], tool_response.response)[:10]
|
||||
break
|
||||
return retrieved_docs
|
||||
|
||||
|
||||
def extract_section(
|
||||
text: str, start_marker: str, end_marker: str | None = None
|
||||
) -> str | None:
|
||||
"""Extract text between markers, returning None if markers not found"""
|
||||
parts = text.split(start_marker)
|
||||
|
||||
if len(parts) == 1:
|
||||
return None
|
||||
|
||||
after_start = parts[1].strip()
|
||||
|
||||
if not end_marker:
|
||||
return after_start
|
||||
|
||||
extract = after_start.split(end_marker)[0]
|
||||
|
||||
return extract.strip()
|
||||
@@ -0,0 +1,72 @@
|
||||
from operator import add
|
||||
from typing import Annotated
|
||||
from typing import Dict
|
||||
from typing import TypedDict
|
||||
|
||||
from pydantic import BaseModel
|
||||
|
||||
from onyx.agents.agent_search.core_state import CoreState
|
||||
from onyx.agents.agent_search.orchestration.states import ToolCallUpdate
|
||||
from onyx.agents.agent_search.orchestration.states import ToolChoiceInput
|
||||
from onyx.agents.agent_search.orchestration.states import ToolChoiceUpdate
|
||||
from onyx.configs.constants import DocumentSource
|
||||
|
||||
|
||||
### States ###
|
||||
class LoggerUpdate(BaseModel):
|
||||
log_messages: Annotated[list[str], add] = []
|
||||
|
||||
|
||||
class SearchSourcesObjectsUpdate(LoggerUpdate):
|
||||
analysis_objects: list[str] = []
|
||||
analysis_sources: list[DocumentSource] = []
|
||||
|
||||
|
||||
class ObjectSourceInput(LoggerUpdate):
|
||||
object_source_combination: tuple[str, DocumentSource]
|
||||
|
||||
|
||||
class ObjectSourceResearchUpdate(LoggerUpdate):
|
||||
object_source_research_results: Annotated[list[Dict[str, str]], add] = []
|
||||
|
||||
|
||||
class ObjectInformationInput(LoggerUpdate):
|
||||
object_information: Dict[str, str]
|
||||
|
||||
|
||||
class ObjectResearchInformationUpdate(LoggerUpdate):
|
||||
object_research_information_results: Annotated[list[Dict[str, str]], add] = []
|
||||
|
||||
|
||||
class ObjectResearchUpdate(LoggerUpdate):
|
||||
object_research_results: Annotated[list[Dict[str, str]], add] = []
|
||||
|
||||
|
||||
class ResearchUpdate(LoggerUpdate):
|
||||
research_results: str | None = None
|
||||
|
||||
|
||||
## Graph Input State
|
||||
class MainInput(CoreState):
|
||||
pass
|
||||
|
||||
|
||||
## Graph State
|
||||
class MainState(
|
||||
# This includes the core state
|
||||
MainInput,
|
||||
ToolChoiceInput,
|
||||
ToolCallUpdate,
|
||||
ToolChoiceUpdate,
|
||||
SearchSourcesObjectsUpdate,
|
||||
ObjectSourceResearchUpdate,
|
||||
ObjectResearchInformationUpdate,
|
||||
ObjectResearchUpdate,
|
||||
ResearchUpdate,
|
||||
):
|
||||
pass
|
||||
|
||||
|
||||
## Graph Output State - presently not used
|
||||
class MainOutput(TypedDict):
|
||||
log_messages: list[str]
|
||||
@@ -8,6 +8,10 @@ from langgraph.graph.state import CompiledStateGraph
|
||||
|
||||
from onyx.agents.agent_search.basic.graph_builder import basic_graph_builder
|
||||
from onyx.agents.agent_search.basic.states import BasicInput
|
||||
from onyx.agents.agent_search.dc_search_analysis.graph_builder import (
|
||||
divide_and_conquer_graph_builder,
|
||||
)
|
||||
from onyx.agents.agent_search.dc_search_analysis.states import MainInput as DCMainInput
|
||||
from onyx.agents.agent_search.deep_search.main.graph_builder import (
|
||||
main_graph_builder as main_graph_builder_a,
|
||||
)
|
||||
@@ -82,7 +86,7 @@ def _parse_agent_event(
|
||||
def manage_sync_streaming(
|
||||
compiled_graph: CompiledStateGraph,
|
||||
config: GraphConfig,
|
||||
graph_input: BasicInput | MainInput,
|
||||
graph_input: BasicInput | MainInput | DCMainInput,
|
||||
) -> Iterable[StreamEvent]:
|
||||
message_id = config.persistence.message_id if config.persistence else None
|
||||
for event in compiled_graph.stream(
|
||||
@@ -96,7 +100,7 @@ def manage_sync_streaming(
|
||||
def run_graph(
|
||||
compiled_graph: CompiledStateGraph,
|
||||
config: GraphConfig,
|
||||
input: BasicInput | MainInput,
|
||||
input: BasicInput | MainInput | DCMainInput,
|
||||
) -> AnswerStream:
|
||||
config.behavior.perform_initial_search_decomposition = (
|
||||
INITIAL_SEARCH_DECOMPOSITION_ENABLED
|
||||
@@ -146,6 +150,16 @@ def run_basic_graph(
|
||||
return run_graph(compiled_graph, config, input)
|
||||
|
||||
|
||||
def run_dc_graph(
|
||||
config: GraphConfig,
|
||||
) -> AnswerStream:
|
||||
graph = divide_and_conquer_graph_builder()
|
||||
compiled_graph = graph.compile()
|
||||
input = DCMainInput(log_messages=[])
|
||||
config.inputs.search_request.query = config.inputs.search_request.query.strip()
|
||||
return run_graph(compiled_graph, config, input)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
for _ in range(1):
|
||||
query_start_time = datetime.now()
|
||||
|
||||
@@ -180,3 +180,35 @@ def binary_string_test_after_answer_separator(
|
||||
relevant_text = text.split(f"{separator}")[-1]
|
||||
|
||||
return binary_string_test(relevant_text, positive_value)
|
||||
|
||||
|
||||
def build_dc_search_prompt(
|
||||
question: str,
|
||||
original_question: str,
|
||||
docs: list[InferenceSection],
|
||||
persona_specification: str,
|
||||
config: LLMConfig,
|
||||
) -> list[SystemMessage | HumanMessage | AIMessage | ToolMessage]:
|
||||
system_message = SystemMessage(
|
||||
content=persona_specification,
|
||||
)
|
||||
|
||||
date_str = build_date_time_string()
|
||||
|
||||
docs_str = format_docs(docs)
|
||||
|
||||
docs_str = trim_prompt_piece(
|
||||
config,
|
||||
docs_str,
|
||||
SUB_QUESTION_RAG_PROMPT + question + original_question + date_str,
|
||||
)
|
||||
human_message = HumanMessage(
|
||||
content=SUB_QUESTION_RAG_PROMPT.format(
|
||||
question=question,
|
||||
original_question=original_question,
|
||||
context=docs_str,
|
||||
date_prompt=date_str,
|
||||
)
|
||||
)
|
||||
|
||||
return [system_message, human_message]
|
||||
|
||||
@@ -1,5 +1,6 @@
|
||||
import logging
|
||||
import multiprocessing
|
||||
import os
|
||||
import time
|
||||
from typing import Any
|
||||
from typing import cast
|
||||
@@ -305,7 +306,7 @@ def wait_for_db(sender: Any, **kwargs: Any) -> None:
|
||||
|
||||
|
||||
def on_secondary_worker_init(sender: Any, **kwargs: Any) -> None:
|
||||
logger.info("Running as a secondary celery worker.")
|
||||
logger.info(f"Running as a secondary celery worker: pid={os.getpid()}")
|
||||
|
||||
# Set up variables for waiting on primary worker
|
||||
WAIT_INTERVAL = 5
|
||||
|
||||
7
backend/onyx/background/celery/apps/client.py
Normal file
7
backend/onyx/background/celery/apps/client.py
Normal file
@@ -0,0 +1,7 @@
|
||||
from celery import Celery
|
||||
|
||||
import onyx.background.celery.apps.app_base as app_base
|
||||
|
||||
celery_app = Celery(__name__)
|
||||
celery_app.config_from_object("onyx.background.celery.configs.client")
|
||||
celery_app.Task = app_base.TenantAwareTask # type: ignore [misc]
|
||||
@@ -1,4 +1,5 @@
|
||||
import logging
|
||||
import os
|
||||
from typing import Any
|
||||
from typing import cast
|
||||
|
||||
@@ -95,7 +96,7 @@ def on_worker_init(sender: Worker, **kwargs: Any) -> None:
|
||||
app_base.wait_for_db(sender, **kwargs)
|
||||
app_base.wait_for_vespa_or_shutdown(sender, **kwargs)
|
||||
|
||||
logger.info("Running as the primary celery worker.")
|
||||
logger.info(f"Running as the primary celery worker: pid={os.getpid()}")
|
||||
|
||||
# Less startup checks in multi-tenant case
|
||||
if MULTI_TENANT:
|
||||
|
||||
16
backend/onyx/background/celery/configs/client.py
Normal file
16
backend/onyx/background/celery/configs/client.py
Normal file
@@ -0,0 +1,16 @@
|
||||
import onyx.background.celery.configs.base as shared_config
|
||||
|
||||
broker_url = shared_config.broker_url
|
||||
broker_connection_retry_on_startup = shared_config.broker_connection_retry_on_startup
|
||||
broker_pool_limit = shared_config.broker_pool_limit
|
||||
broker_transport_options = shared_config.broker_transport_options
|
||||
|
||||
redis_socket_keepalive = shared_config.redis_socket_keepalive
|
||||
redis_retry_on_timeout = shared_config.redis_retry_on_timeout
|
||||
redis_backend_health_check_interval = shared_config.redis_backend_health_check_interval
|
||||
|
||||
result_backend = shared_config.result_backend
|
||||
result_expires = shared_config.result_expires # 86400 seconds is the default
|
||||
|
||||
task_default_priority = shared_config.task_default_priority
|
||||
task_acks_late = shared_config.task_acks_late
|
||||
20
backend/onyx/background/celery/versioned_apps/client.py
Normal file
20
backend/onyx/background/celery/versioned_apps/client.py
Normal file
@@ -0,0 +1,20 @@
|
||||
"""Factory stub for running celery worker / celery beat.
|
||||
This code is different from the primary/beat stubs because there is no EE version to
|
||||
fetch. Port over the code in those files if we add an EE version of this worker.
|
||||
|
||||
This is an app stub purely for sending tasks as a client.
|
||||
"""
|
||||
from celery import Celery
|
||||
|
||||
from onyx.utils.variable_functionality import set_is_ee_based_on_env_variable
|
||||
|
||||
set_is_ee_based_on_env_variable()
|
||||
|
||||
|
||||
def get_app() -> Celery:
|
||||
from onyx.background.celery.apps.client import celery_app
|
||||
|
||||
return celery_app
|
||||
|
||||
|
||||
app = get_app()
|
||||
@@ -10,6 +10,7 @@ from onyx.agents.agent_search.models import GraphPersistence
|
||||
from onyx.agents.agent_search.models import GraphSearchConfig
|
||||
from onyx.agents.agent_search.models import GraphTooling
|
||||
from onyx.agents.agent_search.run_graph import run_basic_graph
|
||||
from onyx.agents.agent_search.run_graph import run_dc_graph
|
||||
from onyx.agents.agent_search.run_graph import run_main_graph
|
||||
from onyx.chat.models import AgentAnswerPiece
|
||||
from onyx.chat.models import AnswerPacket
|
||||
@@ -142,11 +143,18 @@ class Answer:
|
||||
yield from self._processed_stream
|
||||
return
|
||||
|
||||
run_langgraph = (
|
||||
run_main_graph
|
||||
if self.graph_config.behavior.use_agentic_search
|
||||
else run_basic_graph
|
||||
)
|
||||
if self.graph_config.behavior.use_agentic_search:
|
||||
run_langgraph = run_main_graph
|
||||
elif (
|
||||
self.graph_config.inputs.search_request.persona
|
||||
and self.graph_config.inputs.search_request.persona.description.startswith(
|
||||
"DivCon Beta Agent"
|
||||
)
|
||||
):
|
||||
run_langgraph = run_dc_graph
|
||||
else:
|
||||
run_langgraph = run_basic_graph
|
||||
|
||||
stream = run_langgraph(
|
||||
self.graph_config,
|
||||
)
|
||||
|
||||
@@ -43,6 +43,7 @@ from onyx.chat.prompt_builder.answer_prompt_builder import default_build_user_me
|
||||
from onyx.configs.chat_configs import CHAT_TARGET_CHUNK_PERCENTAGE
|
||||
from onyx.configs.chat_configs import DISABLE_LLM_CHOOSE_SEARCH
|
||||
from onyx.configs.chat_configs import MAX_CHUNKS_FED_TO_CHAT
|
||||
from onyx.configs.chat_configs import SELECTED_SECTIONS_MAX_WINDOW_PERCENTAGE
|
||||
from onyx.configs.constants import AGENT_SEARCH_INITIAL_KEY
|
||||
from onyx.configs.constants import BASIC_KEY
|
||||
from onyx.configs.constants import MessageType
|
||||
@@ -692,8 +693,13 @@ def stream_chat_message_objects(
|
||||
doc_identifiers=identifier_tuples,
|
||||
document_index=document_index,
|
||||
)
|
||||
|
||||
# Add a maximum context size in the case of user-selected docs to prevent
|
||||
# slight inaccuracies in context window size pruning from causing
|
||||
# the entire query to fail
|
||||
document_pruning_config = DocumentPruningConfig(
|
||||
is_manually_selected_docs=True
|
||||
is_manually_selected_docs=True,
|
||||
max_window_percentage=SELECTED_SECTIONS_MAX_WINDOW_PERCENTAGE,
|
||||
)
|
||||
|
||||
# In case the search doc is deleted, just don't include it
|
||||
|
||||
@@ -312,11 +312,14 @@ def prune_sections(
|
||||
)
|
||||
|
||||
|
||||
def _merge_doc_chunks(chunks: list[InferenceChunk]) -> InferenceSection:
|
||||
def _merge_doc_chunks(chunks: list[InferenceChunk]) -> tuple[InferenceSection, int]:
|
||||
assert (
|
||||
len(set([chunk.document_id for chunk in chunks])) == 1
|
||||
), "One distinct document must be passed into merge_doc_chunks"
|
||||
|
||||
ADJACENT_CHUNK_SEP = "\n"
|
||||
DISTANT_CHUNK_SEP = "\n\n...\n\n"
|
||||
|
||||
# Assuming there are no duplicates by this point
|
||||
sorted_chunks = sorted(chunks, key=lambda x: x.chunk_id)
|
||||
|
||||
@@ -324,33 +327,48 @@ def _merge_doc_chunks(chunks: list[InferenceChunk]) -> InferenceSection:
|
||||
chunks, key=lambda x: x.score if x.score is not None else float("-inf")
|
||||
)
|
||||
|
||||
added_chars = 0
|
||||
merged_content = []
|
||||
for i, chunk in enumerate(sorted_chunks):
|
||||
if i > 0:
|
||||
prev_chunk_id = sorted_chunks[i - 1].chunk_id
|
||||
if chunk.chunk_id == prev_chunk_id + 1:
|
||||
merged_content.append("\n")
|
||||
else:
|
||||
merged_content.append("\n\n...\n\n")
|
||||
sep = (
|
||||
ADJACENT_CHUNK_SEP
|
||||
if chunk.chunk_id == prev_chunk_id + 1
|
||||
else DISTANT_CHUNK_SEP
|
||||
)
|
||||
merged_content.append(sep)
|
||||
added_chars += len(sep)
|
||||
merged_content.append(chunk.content)
|
||||
|
||||
combined_content = "".join(merged_content)
|
||||
|
||||
return InferenceSection(
|
||||
center_chunk=center_chunk,
|
||||
chunks=sorted_chunks,
|
||||
combined_content=combined_content,
|
||||
return (
|
||||
InferenceSection(
|
||||
center_chunk=center_chunk,
|
||||
chunks=sorted_chunks,
|
||||
combined_content=combined_content,
|
||||
),
|
||||
added_chars,
|
||||
)
|
||||
|
||||
|
||||
def _merge_sections(sections: list[InferenceSection]) -> list[InferenceSection]:
|
||||
docs_map: dict[str, dict[int, InferenceChunk]] = defaultdict(dict)
|
||||
doc_order: dict[str, int] = {}
|
||||
combined_section_lengths: dict[str, int] = defaultdict(lambda: 0)
|
||||
|
||||
# chunk de-duping and doc ordering
|
||||
for index, section in enumerate(sections):
|
||||
if section.center_chunk.document_id not in doc_order:
|
||||
doc_order[section.center_chunk.document_id] = index
|
||||
|
||||
combined_section_lengths[section.center_chunk.document_id] += len(
|
||||
section.combined_content
|
||||
)
|
||||
|
||||
chunks_map = docs_map[section.center_chunk.document_id]
|
||||
for chunk in [section.center_chunk] + section.chunks:
|
||||
chunks_map = docs_map[section.center_chunk.document_id]
|
||||
existing_chunk = chunks_map.get(chunk.chunk_id)
|
||||
if (
|
||||
existing_chunk is None
|
||||
@@ -361,8 +379,22 @@ def _merge_sections(sections: list[InferenceSection]) -> list[InferenceSection]:
|
||||
chunks_map[chunk.chunk_id] = chunk
|
||||
|
||||
new_sections = []
|
||||
for section_chunks in docs_map.values():
|
||||
new_sections.append(_merge_doc_chunks(chunks=list(section_chunks.values())))
|
||||
for doc_id, section_chunks in docs_map.items():
|
||||
section_chunks_list = list(section_chunks.values())
|
||||
merged_section, added_chars = _merge_doc_chunks(chunks=section_chunks_list)
|
||||
|
||||
previous_length = combined_section_lengths[doc_id] + added_chars
|
||||
# After merging, ensure the content respects the pruning done earlier. Each
|
||||
# combined section is restricted to the sum of the lengths of the sections
|
||||
# from the pruning step. Technically the correct approach would be to prune based
|
||||
# on tokens AGAIN, but this is a good approximation and worth not adding the
|
||||
# tokenization overhead. This could also be fixed if we added a way of removing
|
||||
# chunks from sections in the pruning step; at the moment this issue largely
|
||||
# exists because we only trim the final section's combined_content.
|
||||
merged_section.combined_content = merged_section.combined_content[
|
||||
:previous_length
|
||||
]
|
||||
new_sections.append(merged_section)
|
||||
|
||||
# Sort by highest score, then by original document order
|
||||
# It is now 1 large section per doc, the center chunk being the one with the highest score
|
||||
|
||||
@@ -16,6 +16,9 @@ MAX_CHUNKS_FED_TO_CHAT = float(os.environ.get("MAX_CHUNKS_FED_TO_CHAT") or 10.0)
|
||||
# ~3k input, half for docs, half for chat history + prompts
|
||||
CHAT_TARGET_CHUNK_PERCENTAGE = 512 * 3 / 3072
|
||||
|
||||
# Maximum percentage of the context window to fill with selected sections
|
||||
SELECTED_SECTIONS_MAX_WINDOW_PERCENTAGE = 0.8
|
||||
|
||||
# 1 / (1 + DOC_TIME_DECAY * doc-age-in-years), set to 0 to have no decay
|
||||
# Capped in Vespa at 0.5
|
||||
DOC_TIME_DECAY = float(
|
||||
|
||||
@@ -13,6 +13,7 @@ from typing import TYPE_CHECKING
|
||||
from typing import TypeVar
|
||||
from urllib.parse import parse_qs
|
||||
from urllib.parse import quote
|
||||
from urllib.parse import urljoin
|
||||
from urllib.parse import urlparse
|
||||
|
||||
import requests
|
||||
@@ -342,9 +343,14 @@ def build_confluence_document_id(
|
||||
Returns:
|
||||
str: The document id
|
||||
"""
|
||||
if is_cloud and not base_url.endswith("/wiki"):
|
||||
base_url += "/wiki"
|
||||
return f"{base_url}{content_url}"
|
||||
|
||||
# NOTE: urljoin is tricky and will drop the last segment of the base if it doesn't
|
||||
# end with "/" because it believes that makes it a file.
|
||||
final_url = base_url.rstrip("/") + "/"
|
||||
if is_cloud and not final_url.endswith("/wiki/"):
|
||||
final_url = urljoin(final_url, "wiki") + "/"
|
||||
final_url = urljoin(final_url, content_url.lstrip("/"))
|
||||
return final_url
|
||||
|
||||
|
||||
def datetime_from_string(datetime_string: str) -> datetime:
|
||||
@@ -454,6 +460,19 @@ def _handle_http_error(e: requests.HTTPError, attempt: int) -> int:
|
||||
logger.warning("HTTPError with `None` as response or as headers")
|
||||
raise e
|
||||
|
||||
# Confluence Server returns 403 when rate limited
|
||||
if e.response.status_code == 403:
|
||||
FORBIDDEN_MAX_RETRY_ATTEMPTS = 7
|
||||
FORBIDDEN_RETRY_DELAY = 10
|
||||
if attempt < FORBIDDEN_MAX_RETRY_ATTEMPTS:
|
||||
logger.warning(
|
||||
"403 error. This sometimes happens when we hit "
|
||||
f"Confluence rate limits. Retrying in {FORBIDDEN_RETRY_DELAY} seconds..."
|
||||
)
|
||||
return FORBIDDEN_RETRY_DELAY
|
||||
|
||||
raise e
|
||||
|
||||
if (
|
||||
e.response.status_code != 429
|
||||
and RATE_LIMIT_MESSAGE_LOWERCASE not in e.response.text.lower()
|
||||
|
||||
@@ -1,4 +1,5 @@
|
||||
import base64
|
||||
import time
|
||||
from collections.abc import Generator
|
||||
from datetime import datetime
|
||||
from datetime import timedelta
|
||||
@@ -7,6 +8,8 @@ from typing import Any
|
||||
from typing import cast
|
||||
|
||||
import requests
|
||||
from requests.adapters import HTTPAdapter
|
||||
from urllib3.util import Retry
|
||||
|
||||
from onyx.configs.app_configs import CONTINUE_ON_CONNECTOR_FAILURE
|
||||
from onyx.configs.app_configs import GONG_CONNECTOR_START_TIME
|
||||
@@ -21,13 +24,14 @@ from onyx.connectors.models import Document
|
||||
from onyx.connectors.models import TextSection
|
||||
from onyx.utils.logger import setup_logger
|
||||
|
||||
|
||||
logger = setup_logger()
|
||||
|
||||
GONG_BASE_URL = "https://us-34014.api.gong.io"
|
||||
|
||||
|
||||
class GongConnector(LoadConnector, PollConnector):
|
||||
BASE_URL = "https://api.gong.io"
|
||||
MAX_CALL_DETAILS_ATTEMPTS = 6
|
||||
CALL_DETAILS_DELAY = 30 # in seconds
|
||||
|
||||
def __init__(
|
||||
self,
|
||||
workspaces: list[str] | None = None,
|
||||
@@ -41,15 +45,23 @@ class GongConnector(LoadConnector, PollConnector):
|
||||
self.auth_token_basic: str | None = None
|
||||
self.hide_user_info = hide_user_info
|
||||
|
||||
def _get_auth_header(self) -> dict[str, str]:
|
||||
if self.auth_token_basic is None:
|
||||
raise ConnectorMissingCredentialError("Gong")
|
||||
retry_strategy = Retry(
|
||||
total=5,
|
||||
backoff_factor=2,
|
||||
status_forcelist=[429, 500, 502, 503, 504],
|
||||
)
|
||||
|
||||
return {"Authorization": f"Basic {self.auth_token_basic}"}
|
||||
session = requests.Session()
|
||||
session.mount(GongConnector.BASE_URL, HTTPAdapter(max_retries=retry_strategy))
|
||||
self._session = session
|
||||
|
||||
@staticmethod
|
||||
def make_url(endpoint: str) -> str:
|
||||
url = f"{GongConnector.BASE_URL}{endpoint}"
|
||||
return url
|
||||
|
||||
def _get_workspace_id_map(self) -> dict[str, str]:
|
||||
url = f"{GONG_BASE_URL}/v2/workspaces"
|
||||
response = requests.get(url, headers=self._get_auth_header())
|
||||
response = self._session.get(GongConnector.make_url("/v2/workspaces"))
|
||||
response.raise_for_status()
|
||||
|
||||
workspaces_details = response.json().get("workspaces")
|
||||
@@ -66,7 +78,6 @@ class GongConnector(LoadConnector, PollConnector):
|
||||
def _get_transcript_batches(
|
||||
self, start_datetime: str | None = None, end_datetime: str | None = None
|
||||
) -> Generator[list[dict[str, Any]], None, None]:
|
||||
url = f"{GONG_BASE_URL}/v2/calls/transcript"
|
||||
body: dict[str, dict] = {"filter": {}}
|
||||
if start_datetime:
|
||||
body["filter"]["fromDateTime"] = start_datetime
|
||||
@@ -94,8 +105,8 @@ class GongConnector(LoadConnector, PollConnector):
|
||||
del body["filter"]["workspaceId"]
|
||||
|
||||
while True:
|
||||
response = requests.post(
|
||||
url, headers=self._get_auth_header(), json=body
|
||||
response = self._session.post(
|
||||
GongConnector.make_url("/v2/calls/transcript"), json=body
|
||||
)
|
||||
# If no calls in the range, just break out
|
||||
if response.status_code == 404:
|
||||
@@ -125,14 +136,14 @@ class GongConnector(LoadConnector, PollConnector):
|
||||
yield transcripts
|
||||
|
||||
def _get_call_details_by_ids(self, call_ids: list[str]) -> dict:
|
||||
url = f"{GONG_BASE_URL}/v2/calls/extensive"
|
||||
|
||||
body = {
|
||||
"filter": {"callIds": call_ids},
|
||||
"contentSelector": {"exposedFields": {"parties": True}},
|
||||
}
|
||||
|
||||
response = requests.post(url, headers=self._get_auth_header(), json=body)
|
||||
response = self._session.post(
|
||||
GongConnector.make_url("/v2/calls/extensive"), json=body
|
||||
)
|
||||
response.raise_for_status()
|
||||
|
||||
calls = response.json().get("calls")
|
||||
@@ -165,24 +176,74 @@ class GongConnector(LoadConnector, PollConnector):
|
||||
def _fetch_calls(
|
||||
self, start_datetime: str | None = None, end_datetime: str | None = None
|
||||
) -> GenerateDocumentsOutput:
|
||||
num_calls = 0
|
||||
|
||||
for transcript_batch in self._get_transcript_batches(
|
||||
start_datetime, end_datetime
|
||||
):
|
||||
doc_batch: list[Document] = []
|
||||
|
||||
call_ids = cast(
|
||||
transcript_call_ids = cast(
|
||||
list[str],
|
||||
[t.get("callId") for t in transcript_batch if t.get("callId")],
|
||||
)
|
||||
call_details_map = self._get_call_details_by_ids(call_ids)
|
||||
|
||||
call_details_map: dict[str, Any] = {}
|
||||
|
||||
# There's a likely race condition in the API where a transcript will have a
|
||||
# call id but the call to v2/calls/extensive will not return all of the id's
|
||||
# retry with exponential backoff has been observed to mitigate this
|
||||
# in ~2 minutes
|
||||
current_attempt = 0
|
||||
while True:
|
||||
current_attempt += 1
|
||||
call_details_map = self._get_call_details_by_ids(transcript_call_ids)
|
||||
if set(transcript_call_ids) == set(call_details_map.keys()):
|
||||
# we got all the id's we were expecting ... break and continue
|
||||
break
|
||||
|
||||
# we are missing some id's. Log and retry with exponential backoff
|
||||
missing_call_ids = set(transcript_call_ids) - set(
|
||||
call_details_map.keys()
|
||||
)
|
||||
logger.warning(
|
||||
f"_get_call_details_by_ids is missing call id's: "
|
||||
f"current_attempt={current_attempt} "
|
||||
f"missing_call_ids={missing_call_ids}"
|
||||
)
|
||||
if current_attempt >= self.MAX_CALL_DETAILS_ATTEMPTS:
|
||||
raise RuntimeError(
|
||||
f"Attempt count exceeded for _get_call_details_by_ids: "
|
||||
f"missing_call_ids={missing_call_ids} "
|
||||
f"max_attempts={self.MAX_CALL_DETAILS_ATTEMPTS}"
|
||||
)
|
||||
|
||||
wait_seconds = self.CALL_DETAILS_DELAY * pow(2, current_attempt - 1)
|
||||
logger.warning(
|
||||
f"_get_call_details_by_ids waiting to retry: "
|
||||
f"wait={wait_seconds}s "
|
||||
f"current_attempt={current_attempt} "
|
||||
f"next_attempt={current_attempt+1} "
|
||||
f"max_attempts={self.MAX_CALL_DETAILS_ATTEMPTS}"
|
||||
)
|
||||
time.sleep(wait_seconds)
|
||||
|
||||
# now we can iterate per call/transcript
|
||||
for transcript in transcript_batch:
|
||||
call_id = transcript.get("callId")
|
||||
|
||||
if not call_id or call_id not in call_details_map:
|
||||
# NOTE(rkuo): seeing odd behavior where call_ids from the transcript
|
||||
# don't have call details. adding error debugging logs to trace.
|
||||
logger.error(
|
||||
f"Couldn't get call information for Call ID: {call_id}"
|
||||
)
|
||||
if call_id:
|
||||
logger.error(
|
||||
f"Call debug info: call_id={call_id} "
|
||||
f"call_ids={transcript_call_ids} "
|
||||
f"call_details_map={call_details_map.keys()}"
|
||||
)
|
||||
if not self.continue_on_fail:
|
||||
raise RuntimeError(
|
||||
f"Couldn't get call information for Call ID: {call_id}"
|
||||
@@ -195,7 +256,8 @@ class GongConnector(LoadConnector, PollConnector):
|
||||
call_time_str = call_metadata["started"]
|
||||
call_title = call_metadata["title"]
|
||||
logger.info(
|
||||
f"Indexing Gong call from {call_time_str.split('T', 1)[0]}: {call_title}"
|
||||
f"{num_calls+1}: Indexing Gong call id {call_id} "
|
||||
f"from {call_time_str.split('T', 1)[0]}: {call_title}"
|
||||
)
|
||||
|
||||
call_parties = cast(list[dict] | None, call_details.get("parties"))
|
||||
@@ -254,8 +316,13 @@ class GongConnector(LoadConnector, PollConnector):
|
||||
metadata={"client": call_metadata.get("system")},
|
||||
)
|
||||
)
|
||||
|
||||
num_calls += 1
|
||||
|
||||
yield doc_batch
|
||||
|
||||
logger.info(f"_fetch_calls finished: num_calls={num_calls}")
|
||||
|
||||
def load_credentials(self, credentials: dict[str, Any]) -> dict[str, Any] | None:
|
||||
combined = (
|
||||
f'{credentials["gong_access_key"]}:{credentials["gong_access_key_secret"]}'
|
||||
@@ -263,6 +330,13 @@ class GongConnector(LoadConnector, PollConnector):
|
||||
self.auth_token_basic = base64.b64encode(combined.encode("utf-8")).decode(
|
||||
"utf-8"
|
||||
)
|
||||
|
||||
if self.auth_token_basic is None:
|
||||
raise ConnectorMissingCredentialError("Gong")
|
||||
|
||||
self._session.headers.update(
|
||||
{"Authorization": f"Basic {self.auth_token_basic}"}
|
||||
)
|
||||
return None
|
||||
|
||||
def load_from_state(self) -> GenerateDocumentsOutput:
|
||||
|
||||
@@ -445,6 +445,9 @@ class GoogleDriveConnector(SlimConnector, CheckpointConnector[GoogleDriveCheckpo
|
||||
logger.warning(
|
||||
f"User '{user_email}' does not have access to the drive APIs."
|
||||
)
|
||||
# mark this user as done so we don't try to retrieve anything for them
|
||||
# again
|
||||
curr_stage.stage = DriveRetrievalStage.DONE
|
||||
return
|
||||
raise
|
||||
|
||||
@@ -581,6 +584,25 @@ class GoogleDriveConnector(SlimConnector, CheckpointConnector[GoogleDriveCheckpo
|
||||
drive_ids_to_retrieve, checkpoint
|
||||
)
|
||||
|
||||
# only process emails that we haven't already completed retrieval for
|
||||
non_completed_org_emails = [
|
||||
user_email
|
||||
for user_email, stage in checkpoint.completion_map.items()
|
||||
if stage != DriveRetrievalStage.DONE
|
||||
]
|
||||
|
||||
# don't process too many emails before returning a checkpoint. This is
|
||||
# to resolve the case where there are a ton of emails that don't have access
|
||||
# to the drive APIs. Without this, we could loop through these emails for
|
||||
# more than 3 hours, causing a timeout and stalling progress.
|
||||
email_batch_takes_us_to_completion = True
|
||||
MAX_EMAILS_TO_PROCESS_BEFORE_CHECKPOINTING = 50
|
||||
if len(non_completed_org_emails) > MAX_EMAILS_TO_PROCESS_BEFORE_CHECKPOINTING:
|
||||
non_completed_org_emails = non_completed_org_emails[
|
||||
:MAX_EMAILS_TO_PROCESS_BEFORE_CHECKPOINTING
|
||||
]
|
||||
email_batch_takes_us_to_completion = False
|
||||
|
||||
user_retrieval_gens = [
|
||||
self._impersonate_user_for_retrieval(
|
||||
email,
|
||||
@@ -591,10 +613,14 @@ class GoogleDriveConnector(SlimConnector, CheckpointConnector[GoogleDriveCheckpo
|
||||
start,
|
||||
end,
|
||||
)
|
||||
for email in all_org_emails
|
||||
for email in non_completed_org_emails
|
||||
]
|
||||
yield from parallel_yield(user_retrieval_gens, max_workers=MAX_DRIVE_WORKERS)
|
||||
|
||||
# if there are more emails to process, don't mark as complete
|
||||
if not email_batch_takes_us_to_completion:
|
||||
return
|
||||
|
||||
remaining_folders = (
|
||||
drive_ids_to_retrieve | folder_ids_to_retrieve
|
||||
) - self._retrieved_ids
|
||||
|
||||
@@ -20,7 +20,8 @@ from onyx.connectors.models import ConnectorMissingCredentialError
|
||||
from onyx.connectors.models import Document
|
||||
from onyx.connectors.models import SlimDocument
|
||||
from onyx.connectors.models import TextSection
|
||||
from onyx.file_processing.extract_file_text import ALL_ACCEPTED_FILE_EXTENSIONS
|
||||
from onyx.file_processing.extract_file_text import ACCEPTED_DOCUMENT_FILE_EXTENSIONS
|
||||
from onyx.file_processing.extract_file_text import ACCEPTED_PLAIN_TEXT_FILE_EXTENSIONS
|
||||
from onyx.file_processing.extract_file_text import extract_file_text
|
||||
from onyx.indexing.indexing_heartbeat import IndexingHeartbeatInterface
|
||||
from onyx.utils.logger import setup_logger
|
||||
@@ -84,14 +85,21 @@ class HighspotConnector(LoadConnector, PollConnector, SlimConnector):
|
||||
Populate the spot ID map with all available spots.
|
||||
Keys are stored as lowercase for case-insensitive lookups.
|
||||
"""
|
||||
spots = self.client.get_spots()
|
||||
for spot in spots:
|
||||
if "title" in spot and "id" in spot:
|
||||
spot_name = spot["title"]
|
||||
self._spot_id_map[spot_name.lower()] = spot["id"]
|
||||
try:
|
||||
spots = self.client.get_spots()
|
||||
for spot in spots:
|
||||
if "title" in spot and "id" in spot:
|
||||
spot_name = spot["title"]
|
||||
self._spot_id_map[spot_name.lower()] = spot["id"]
|
||||
|
||||
self._all_spots_fetched = True
|
||||
logger.info(f"Retrieved {len(self._spot_id_map)} spots from Highspot")
|
||||
self._all_spots_fetched = True
|
||||
logger.info(f"Retrieved {len(self._spot_id_map)} spots from Highspot")
|
||||
except HighspotClientError as e:
|
||||
logger.error(f"Error retrieving spots from Highspot: {str(e)}")
|
||||
raise
|
||||
except Exception as e:
|
||||
logger.error(f"Unexpected error retrieving spots from Highspot: {str(e)}")
|
||||
raise
|
||||
|
||||
def _get_all_spot_names(self) -> List[str]:
|
||||
"""
|
||||
@@ -151,116 +159,142 @@ class HighspotConnector(LoadConnector, PollConnector, SlimConnector):
|
||||
Batches of Document objects
|
||||
"""
|
||||
doc_batch: list[Document] = []
|
||||
try:
|
||||
# If no spots specified, get all spots
|
||||
spot_names_to_process = self.spot_names
|
||||
if not spot_names_to_process:
|
||||
spot_names_to_process = self._get_all_spot_names()
|
||||
if not spot_names_to_process:
|
||||
logger.warning("No spots found in Highspot")
|
||||
raise ValueError("No spots found in Highspot")
|
||||
logger.info(
|
||||
f"No spots specified, using all {len(spot_names_to_process)} available spots"
|
||||
)
|
||||
|
||||
# If no spots specified, get all spots
|
||||
spot_names_to_process = self.spot_names
|
||||
if not spot_names_to_process:
|
||||
spot_names_to_process = self._get_all_spot_names()
|
||||
logger.info(
|
||||
f"No spots specified, using all {len(spot_names_to_process)} available spots"
|
||||
)
|
||||
|
||||
for spot_name in spot_names_to_process:
|
||||
try:
|
||||
spot_id = self._get_spot_id_from_name(spot_name)
|
||||
if spot_id is None:
|
||||
logger.warning(f"Spot ID not found for spot {spot_name}")
|
||||
continue
|
||||
offset = 0
|
||||
has_more = True
|
||||
|
||||
while has_more:
|
||||
logger.info(
|
||||
f"Retrieving items from spot {spot_name}, offset {offset}"
|
||||
)
|
||||
response = self.client.get_spot_items(
|
||||
spot_id=spot_id, offset=offset, page_size=self.batch_size
|
||||
)
|
||||
items = response.get("collection", [])
|
||||
logger.info(f"Received Items: {items}")
|
||||
if not items:
|
||||
has_more = False
|
||||
for spot_name in spot_names_to_process:
|
||||
try:
|
||||
spot_id = self._get_spot_id_from_name(spot_name)
|
||||
if spot_id is None:
|
||||
logger.warning(f"Spot ID not found for spot {spot_name}")
|
||||
continue
|
||||
offset = 0
|
||||
has_more = True
|
||||
|
||||
for item in items:
|
||||
try:
|
||||
item_id = item.get("id")
|
||||
if not item_id:
|
||||
logger.warning("Item without ID found, skipping")
|
||||
continue
|
||||
while has_more:
|
||||
logger.info(
|
||||
f"Retrieving items from spot {spot_name}, offset {offset}"
|
||||
)
|
||||
response = self.client.get_spot_items(
|
||||
spot_id=spot_id, offset=offset, page_size=self.batch_size
|
||||
)
|
||||
items = response.get("collection", [])
|
||||
logger.info(f"Received Items: {items}")
|
||||
if not items:
|
||||
has_more = False
|
||||
continue
|
||||
|
||||
item_details = self.client.get_item(item_id)
|
||||
if not item_details:
|
||||
logger.warning(
|
||||
f"Item {item_id} details not found, skipping"
|
||||
)
|
||||
continue
|
||||
# Apply time filter if specified
|
||||
if start or end:
|
||||
updated_at = item_details.get("date_updated")
|
||||
if updated_at:
|
||||
# Convert to datetime for comparison
|
||||
try:
|
||||
updated_time = datetime.fromisoformat(
|
||||
updated_at.replace("Z", "+00:00")
|
||||
)
|
||||
if (
|
||||
start and updated_time.timestamp() < start
|
||||
) or (end and updated_time.timestamp() > end):
|
||||
for item in items:
|
||||
try:
|
||||
item_id = item.get("id")
|
||||
if not item_id:
|
||||
logger.warning("Item without ID found, skipping")
|
||||
continue
|
||||
|
||||
item_details = self.client.get_item(item_id)
|
||||
if not item_details:
|
||||
logger.warning(
|
||||
f"Item {item_id} details not found, skipping"
|
||||
)
|
||||
continue
|
||||
# Apply time filter if specified
|
||||
if start or end:
|
||||
updated_at = item_details.get("date_updated")
|
||||
if updated_at:
|
||||
# Convert to datetime for comparison
|
||||
try:
|
||||
updated_time = datetime.fromisoformat(
|
||||
updated_at.replace("Z", "+00:00")
|
||||
)
|
||||
if (
|
||||
start
|
||||
and updated_time.timestamp() < start
|
||||
) or (
|
||||
end and updated_time.timestamp() > end
|
||||
):
|
||||
continue
|
||||
except (ValueError, TypeError):
|
||||
# Skip if date cannot be parsed
|
||||
logger.warning(
|
||||
f"Invalid date format for item {item_id}: {updated_at}"
|
||||
)
|
||||
continue
|
||||
except (ValueError, TypeError):
|
||||
# Skip if date cannot be parsed
|
||||
logger.warning(
|
||||
f"Invalid date format for item {item_id}: {updated_at}"
|
||||
)
|
||||
continue
|
||||
|
||||
content = self._get_item_content(item_details)
|
||||
title = item_details.get("title", "")
|
||||
content = self._get_item_content(item_details)
|
||||
|
||||
doc_batch.append(
|
||||
Document(
|
||||
id=f"HIGHSPOT_{item_id}",
|
||||
sections=[
|
||||
TextSection(
|
||||
link=item_details.get(
|
||||
"url",
|
||||
f"https://www.highspot.com/items/{item_id}",
|
||||
title = item_details.get("title", "")
|
||||
|
||||
doc_batch.append(
|
||||
Document(
|
||||
id=f"HIGHSPOT_{item_id}",
|
||||
sections=[
|
||||
TextSection(
|
||||
link=item_details.get(
|
||||
"url",
|
||||
f"https://www.highspot.com/items/{item_id}",
|
||||
),
|
||||
text=content,
|
||||
)
|
||||
],
|
||||
source=DocumentSource.HIGHSPOT,
|
||||
semantic_identifier=title,
|
||||
metadata={
|
||||
"spot_name": spot_name,
|
||||
"type": item_details.get(
|
||||
"content_type", ""
|
||||
),
|
||||
text=content,
|
||||
)
|
||||
],
|
||||
source=DocumentSource.HIGHSPOT,
|
||||
semantic_identifier=title,
|
||||
metadata={
|
||||
"spot_name": spot_name,
|
||||
"type": item_details.get("content_type", ""),
|
||||
"created_at": item_details.get(
|
||||
"date_added", ""
|
||||
),
|
||||
"author": item_details.get("author", ""),
|
||||
"language": item_details.get("language", ""),
|
||||
"can_download": str(
|
||||
item_details.get("can_download", False)
|
||||
),
|
||||
},
|
||||
doc_updated_at=item_details.get("date_updated"),
|
||||
"created_at": item_details.get(
|
||||
"date_added", ""
|
||||
),
|
||||
"author": item_details.get("author", ""),
|
||||
"language": item_details.get(
|
||||
"language", ""
|
||||
),
|
||||
"can_download": str(
|
||||
item_details.get("can_download", False)
|
||||
),
|
||||
},
|
||||
doc_updated_at=item_details.get("date_updated"),
|
||||
)
|
||||
)
|
||||
)
|
||||
|
||||
if len(doc_batch) >= self.batch_size:
|
||||
yield doc_batch
|
||||
doc_batch = []
|
||||
if len(doc_batch) >= self.batch_size:
|
||||
yield doc_batch
|
||||
doc_batch = []
|
||||
|
||||
except HighspotClientError as e:
|
||||
item_id = "ID" if not item_id else item_id
|
||||
logger.error(f"Error retrieving item {item_id}: {str(e)}")
|
||||
except HighspotClientError as e:
|
||||
item_id = "ID" if not item_id else item_id
|
||||
logger.error(
|
||||
f"Error retrieving item {item_id}: {str(e)}"
|
||||
)
|
||||
except Exception as e:
|
||||
item_id = "ID" if not item_id else item_id
|
||||
logger.error(
|
||||
f"Unexpected error for item {item_id}: {str(e)}"
|
||||
)
|
||||
|
||||
has_more = len(items) >= self.batch_size
|
||||
offset += self.batch_size
|
||||
has_more = len(items) >= self.batch_size
|
||||
offset += self.batch_size
|
||||
|
||||
except (HighspotClientError, ValueError) as e:
|
||||
logger.error(f"Error processing spot {spot_name}: {str(e)}")
|
||||
except (HighspotClientError, ValueError) as e:
|
||||
logger.error(f"Error processing spot {spot_name}: {str(e)}")
|
||||
except Exception as e:
|
||||
logger.error(
|
||||
f"Unexpected error processing spot {spot_name}: {str(e)}"
|
||||
)
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"Error in Highspot connector: {str(e)}")
|
||||
raise
|
||||
|
||||
if doc_batch:
|
||||
yield doc_batch
|
||||
@@ -286,7 +320,9 @@ class HighspotConnector(LoadConnector, PollConnector, SlimConnector):
|
||||
# Extract title and description once at the beginning
|
||||
title, description = self._extract_title_and_description(item_details)
|
||||
default_content = f"{title}\n{description}"
|
||||
logger.info(f"Processing item {item_id} with extension {file_extension}")
|
||||
logger.info(
|
||||
f"Processing item {item_id} with extension {file_extension} and file name {content_name}"
|
||||
)
|
||||
|
||||
try:
|
||||
if content_type == "WebLink":
|
||||
@@ -298,30 +334,39 @@ class HighspotConnector(LoadConnector, PollConnector, SlimConnector):
|
||||
|
||||
elif (
|
||||
is_valid_format
|
||||
and file_extension in ALL_ACCEPTED_FILE_EXTENSIONS
|
||||
and (
|
||||
file_extension in ACCEPTED_PLAIN_TEXT_FILE_EXTENSIONS
|
||||
or file_extension in ACCEPTED_DOCUMENT_FILE_EXTENSIONS
|
||||
)
|
||||
and can_download
|
||||
):
|
||||
# For documents, try to get the text content
|
||||
if not item_id: # Ensure item_id is defined
|
||||
return default_content
|
||||
|
||||
content_response = self.client.get_item_content(item_id)
|
||||
# Process and extract text from binary content based on type
|
||||
if content_response:
|
||||
text_content = extract_file_text(
|
||||
BytesIO(content_response), content_name
|
||||
BytesIO(content_response), content_name, False
|
||||
)
|
||||
return text_content
|
||||
return text_content if text_content else default_content
|
||||
return default_content
|
||||
|
||||
else:
|
||||
return default_content
|
||||
|
||||
except HighspotClientError as e:
|
||||
# Use item_id safely in the warning message
|
||||
error_context = f"item {item_id}" if item_id else "item"
|
||||
error_context = f"item {item_id}" if item_id else "(item id not found)"
|
||||
logger.warning(f"Could not retrieve content for {error_context}: {str(e)}")
|
||||
return ""
|
||||
return default_content
|
||||
except ValueError as e:
|
||||
error_context = f"item {item_id}" if item_id else "(item id not found)"
|
||||
logger.error(f"Value error for {error_context}: {str(e)}")
|
||||
return default_content
|
||||
|
||||
except Exception as e:
|
||||
error_context = f"item {item_id}" if item_id else "(item id not found)"
|
||||
logger.error(
|
||||
f"Unexpected error retrieving content for {error_context}: {str(e)}"
|
||||
)
|
||||
return default_content
|
||||
|
||||
def _extract_title_and_description(
|
||||
self, item_details: Dict[str, Any]
|
||||
@@ -358,55 +403,63 @@ class HighspotConnector(LoadConnector, PollConnector, SlimConnector):
|
||||
Batches of SlimDocument objects
|
||||
"""
|
||||
slim_doc_batch: list[SlimDocument] = []
|
||||
|
||||
# If no spots specified, get all spots
|
||||
spot_names_to_process = self.spot_names
|
||||
if not spot_names_to_process:
|
||||
spot_names_to_process = self._get_all_spot_names()
|
||||
logger.info(
|
||||
f"No spots specified, using all {len(spot_names_to_process)} available spots for slim documents"
|
||||
)
|
||||
|
||||
for spot_name in spot_names_to_process:
|
||||
try:
|
||||
spot_id = self._get_spot_id_from_name(spot_name)
|
||||
offset = 0
|
||||
has_more = True
|
||||
|
||||
while has_more:
|
||||
logger.info(
|
||||
f"Retrieving slim documents from spot {spot_name}, offset {offset}"
|
||||
)
|
||||
response = self.client.get_spot_items(
|
||||
spot_id=spot_id, offset=offset, page_size=self.batch_size
|
||||
)
|
||||
|
||||
items = response.get("collection", [])
|
||||
if not items:
|
||||
has_more = False
|
||||
continue
|
||||
|
||||
for item in items:
|
||||
item_id = item.get("id")
|
||||
if not item_id:
|
||||
continue
|
||||
|
||||
slim_doc_batch.append(SlimDocument(id=f"HIGHSPOT_{item_id}"))
|
||||
|
||||
if len(slim_doc_batch) >= _SLIM_BATCH_SIZE:
|
||||
yield slim_doc_batch
|
||||
slim_doc_batch = []
|
||||
|
||||
has_more = len(items) >= self.batch_size
|
||||
offset += self.batch_size
|
||||
|
||||
except (HighspotClientError, ValueError) as e:
|
||||
logger.error(
|
||||
f"Error retrieving slim documents from spot {spot_name}: {str(e)}"
|
||||
try:
|
||||
# If no spots specified, get all spots
|
||||
spot_names_to_process = self.spot_names
|
||||
if not spot_names_to_process:
|
||||
spot_names_to_process = self._get_all_spot_names()
|
||||
if not spot_names_to_process:
|
||||
logger.warning("No spots found in Highspot")
|
||||
raise ValueError("No spots found in Highspot")
|
||||
logger.info(
|
||||
f"No spots specified, using all {len(spot_names_to_process)} available spots for slim documents"
|
||||
)
|
||||
|
||||
if slim_doc_batch:
|
||||
yield slim_doc_batch
|
||||
for spot_name in spot_names_to_process:
|
||||
try:
|
||||
spot_id = self._get_spot_id_from_name(spot_name)
|
||||
offset = 0
|
||||
has_more = True
|
||||
|
||||
while has_more:
|
||||
logger.info(
|
||||
f"Retrieving slim documents from spot {spot_name}, offset {offset}"
|
||||
)
|
||||
response = self.client.get_spot_items(
|
||||
spot_id=spot_id, offset=offset, page_size=self.batch_size
|
||||
)
|
||||
|
||||
items = response.get("collection", [])
|
||||
if not items:
|
||||
has_more = False
|
||||
continue
|
||||
|
||||
for item in items:
|
||||
item_id = item.get("id")
|
||||
if not item_id:
|
||||
continue
|
||||
|
||||
slim_doc_batch.append(
|
||||
SlimDocument(id=f"HIGHSPOT_{item_id}")
|
||||
)
|
||||
|
||||
if len(slim_doc_batch) >= _SLIM_BATCH_SIZE:
|
||||
yield slim_doc_batch
|
||||
slim_doc_batch = []
|
||||
|
||||
has_more = len(items) >= self.batch_size
|
||||
offset += self.batch_size
|
||||
|
||||
except (HighspotClientError, ValueError) as e:
|
||||
logger.error(
|
||||
f"Error retrieving slim documents from spot {spot_name}: {str(e)}"
|
||||
)
|
||||
|
||||
if slim_doc_batch:
|
||||
yield slim_doc_batch
|
||||
except Exception as e:
|
||||
logger.error(f"Error in Highspot Slim Connector: {str(e)}")
|
||||
raise
|
||||
|
||||
def validate_credentials(self) -> bool:
|
||||
"""
|
||||
|
||||
@@ -1,3 +1,4 @@
|
||||
import sys
|
||||
from datetime import datetime
|
||||
from enum import Enum
|
||||
from typing import Any
|
||||
@@ -40,6 +41,9 @@ class TextSection(Section):
|
||||
text: str
|
||||
link: str | None = None
|
||||
|
||||
def __sizeof__(self) -> int:
|
||||
return sys.getsizeof(self.text) + sys.getsizeof(self.link)
|
||||
|
||||
|
||||
class ImageSection(Section):
|
||||
"""Section containing an image reference"""
|
||||
@@ -47,6 +51,9 @@ class ImageSection(Section):
|
||||
image_file_name: str
|
||||
link: str | None = None
|
||||
|
||||
def __sizeof__(self) -> int:
|
||||
return sys.getsizeof(self.image_file_name) + sys.getsizeof(self.link)
|
||||
|
||||
|
||||
class BasicExpertInfo(BaseModel):
|
||||
"""Basic Information for the owner of a document, any of the fields can be left as None
|
||||
@@ -110,6 +117,14 @@ class BasicExpertInfo(BaseModel):
|
||||
)
|
||||
)
|
||||
|
||||
def __sizeof__(self) -> int:
|
||||
size = sys.getsizeof(self.display_name)
|
||||
size += sys.getsizeof(self.first_name)
|
||||
size += sys.getsizeof(self.middle_initial)
|
||||
size += sys.getsizeof(self.last_name)
|
||||
size += sys.getsizeof(self.email)
|
||||
return size
|
||||
|
||||
|
||||
class DocumentBase(BaseModel):
|
||||
"""Used for Onyx ingestion api, the ID is inferred before use if not provided"""
|
||||
@@ -163,6 +178,32 @@ class DocumentBase(BaseModel):
|
||||
attributes.append(k + INDEX_SEPARATOR + v)
|
||||
return attributes
|
||||
|
||||
def __sizeof__(self) -> int:
|
||||
size = sys.getsizeof(self.id)
|
||||
for section in self.sections:
|
||||
size += sys.getsizeof(section)
|
||||
size += sys.getsizeof(self.source)
|
||||
size += sys.getsizeof(self.semantic_identifier)
|
||||
size += sys.getsizeof(self.doc_updated_at)
|
||||
size += sys.getsizeof(self.chunk_count)
|
||||
|
||||
if self.primary_owners is not None:
|
||||
for primary_owner in self.primary_owners:
|
||||
size += sys.getsizeof(primary_owner)
|
||||
else:
|
||||
size += sys.getsizeof(self.primary_owners)
|
||||
|
||||
if self.secondary_owners is not None:
|
||||
for secondary_owner in self.secondary_owners:
|
||||
size += sys.getsizeof(secondary_owner)
|
||||
else:
|
||||
size += sys.getsizeof(self.secondary_owners)
|
||||
|
||||
size += sys.getsizeof(self.title)
|
||||
size += sys.getsizeof(self.from_ingestion_api)
|
||||
size += sys.getsizeof(self.additional_info)
|
||||
return size
|
||||
|
||||
def get_text_content(self) -> str:
|
||||
return " ".join([section.text for section in self.sections if section.text])
|
||||
|
||||
@@ -194,6 +235,12 @@ class Document(DocumentBase):
|
||||
from_ingestion_api=base.from_ingestion_api,
|
||||
)
|
||||
|
||||
def __sizeof__(self) -> int:
|
||||
size = super().__sizeof__()
|
||||
size += sys.getsizeof(self.id)
|
||||
size += sys.getsizeof(self.source)
|
||||
return size
|
||||
|
||||
|
||||
class IndexingDocument(Document):
|
||||
"""Document with processed sections for indexing"""
|
||||
|
||||
@@ -1,4 +1,9 @@
|
||||
import gc
|
||||
import os
|
||||
import sys
|
||||
import tempfile
|
||||
from collections import defaultdict
|
||||
from pathlib import Path
|
||||
from typing import Any
|
||||
|
||||
from simple_salesforce import Salesforce
|
||||
@@ -21,9 +26,13 @@ from onyx.connectors.salesforce.salesforce_calls import get_all_children_of_sf_t
|
||||
from onyx.connectors.salesforce.sqlite_functions import get_affected_parent_ids_by_type
|
||||
from onyx.connectors.salesforce.sqlite_functions import get_record
|
||||
from onyx.connectors.salesforce.sqlite_functions import init_db
|
||||
from onyx.connectors.salesforce.sqlite_functions import sqlite_log_stats
|
||||
from onyx.connectors.salesforce.sqlite_functions import update_sf_db_with_csv
|
||||
from onyx.connectors.salesforce.utils import BASE_DATA_PATH
|
||||
from onyx.connectors.salesforce.utils import get_sqlite_db_path
|
||||
from onyx.indexing.indexing_heartbeat import IndexingHeartbeatInterface
|
||||
from onyx.utils.logger import setup_logger
|
||||
from shared_configs.configs import MULTI_TENANT
|
||||
|
||||
logger = setup_logger()
|
||||
|
||||
@@ -32,6 +41,8 @@ _DEFAULT_PARENT_OBJECT_TYPES = ["Account"]
|
||||
|
||||
|
||||
class SalesforceConnector(LoadConnector, PollConnector, SlimConnector):
|
||||
MAX_BATCH_BYTES = 1024 * 1024
|
||||
|
||||
def __init__(
|
||||
self,
|
||||
batch_size: int = INDEX_BATCH_SIZE,
|
||||
@@ -64,22 +75,45 @@ class SalesforceConnector(LoadConnector, PollConnector, SlimConnector):
|
||||
raise ConnectorMissingCredentialError("Salesforce")
|
||||
return self._sf_client
|
||||
|
||||
def _fetch_from_salesforce(
|
||||
self,
|
||||
@staticmethod
|
||||
def reconstruct_object_types(directory: str) -> dict[str, list[str] | None]:
|
||||
"""
|
||||
Scans the given directory for all CSV files and reconstructs the available object types.
|
||||
Assumes filenames are formatted as "ObjectType.filename.csv" or "ObjectType.csv".
|
||||
|
||||
Args:
|
||||
directory (str): The path to the directory containing CSV files.
|
||||
|
||||
Returns:
|
||||
dict[str, list[str]]: A dictionary mapping object types to lists of file paths.
|
||||
"""
|
||||
object_types = defaultdict(list)
|
||||
|
||||
for filename in os.listdir(directory):
|
||||
if filename.endswith(".csv"):
|
||||
parts = filename.split(".", 1) # Split on the first period
|
||||
object_type = parts[0] # Take the first part as the object type
|
||||
object_types[object_type].append(os.path.join(directory, filename))
|
||||
|
||||
return dict(object_types)
|
||||
|
||||
@staticmethod
|
||||
def _download_object_csvs(
|
||||
directory: str,
|
||||
parent_object_list: list[str],
|
||||
sf_client: Salesforce,
|
||||
start: SecondsSinceUnixEpoch | None = None,
|
||||
end: SecondsSinceUnixEpoch | None = None,
|
||||
) -> GenerateDocumentsOutput:
|
||||
init_db()
|
||||
all_object_types: set[str] = set(self.parent_object_list)
|
||||
) -> None:
|
||||
all_object_types: set[str] = set(parent_object_list)
|
||||
|
||||
logger.info(f"Starting with {len(self.parent_object_list)} parent object types")
|
||||
logger.debug(f"Parent object types: {self.parent_object_list}")
|
||||
logger.info(
|
||||
f"Parent object types: num={len(parent_object_list)} list={parent_object_list}"
|
||||
)
|
||||
|
||||
# This takes like 20 seconds
|
||||
for parent_object_type in self.parent_object_list:
|
||||
child_types = get_all_children_of_sf_type(
|
||||
self.sf_client, parent_object_type
|
||||
)
|
||||
for parent_object_type in parent_object_list:
|
||||
child_types = get_all_children_of_sf_type(sf_client, parent_object_type)
|
||||
all_object_types.update(child_types)
|
||||
logger.debug(
|
||||
f"Found {len(child_types)} child types for {parent_object_type}"
|
||||
@@ -88,20 +122,53 @@ class SalesforceConnector(LoadConnector, PollConnector, SlimConnector):
|
||||
# Always want to make sure user is grabbed for permissioning purposes
|
||||
all_object_types.add("User")
|
||||
|
||||
logger.info(f"Found total of {len(all_object_types)} object types to fetch")
|
||||
logger.debug(f"All object types: {all_object_types}")
|
||||
logger.info(
|
||||
f"All object types: num={len(all_object_types)} list={all_object_types}"
|
||||
)
|
||||
|
||||
# gc.collect()
|
||||
|
||||
# checkpoint - we've found all object types, now time to fetch the data
|
||||
logger.info("Starting to fetch CSVs for all object types")
|
||||
logger.info("Fetching CSVs for all object types")
|
||||
|
||||
# This takes like 30 minutes first time and <2 minutes for updates
|
||||
object_type_to_csv_path = fetch_all_csvs_in_parallel(
|
||||
sf_client=self.sf_client,
|
||||
sf_client=sf_client,
|
||||
object_types=all_object_types,
|
||||
start=start,
|
||||
end=end,
|
||||
target_dir=directory,
|
||||
)
|
||||
|
||||
# print useful information
|
||||
num_csvs = 0
|
||||
num_bytes = 0
|
||||
for object_type, csv_paths in object_type_to_csv_path.items():
|
||||
if not csv_paths:
|
||||
continue
|
||||
|
||||
for csv_path in csv_paths:
|
||||
if not csv_path:
|
||||
continue
|
||||
|
||||
file_path = Path(csv_path)
|
||||
file_size = file_path.stat().st_size
|
||||
num_csvs += 1
|
||||
num_bytes += file_size
|
||||
logger.info(
|
||||
f"CSV info: object_type={object_type} path={csv_path} bytes={file_size}"
|
||||
)
|
||||
|
||||
logger.info(f"CSV info total: total_csvs={num_csvs} total_bytes={num_bytes}")
|
||||
|
||||
@staticmethod
|
||||
def _load_csvs_to_db(csv_directory: str, db_directory: str) -> set[str]:
|
||||
updated_ids: set[str] = set()
|
||||
|
||||
object_type_to_csv_path = SalesforceConnector.reconstruct_object_types(
|
||||
csv_directory
|
||||
)
|
||||
|
||||
# This takes like 10 seconds
|
||||
# This is for testing the rest of the functionality if data has
|
||||
# already been fetched and put in sqlite
|
||||
@@ -120,10 +187,16 @@ class SalesforceConnector(LoadConnector, PollConnector, SlimConnector):
|
||||
# If path is None, it means it failed to fetch the csv
|
||||
if csv_paths is None:
|
||||
continue
|
||||
|
||||
# Go through each csv path and use it to update the db
|
||||
for csv_path in csv_paths:
|
||||
logger.debug(f"Updating {object_type} with {csv_path}")
|
||||
logger.debug(
|
||||
f"Processing CSV: object_type={object_type} "
|
||||
f"csv={csv_path} "
|
||||
f"len={Path(csv_path).stat().st_size}"
|
||||
)
|
||||
new_ids = update_sf_db_with_csv(
|
||||
db_directory,
|
||||
object_type=object_type,
|
||||
csv_download_path=csv_path,
|
||||
)
|
||||
@@ -132,49 +205,127 @@ class SalesforceConnector(LoadConnector, PollConnector, SlimConnector):
|
||||
f"Added {len(new_ids)} new/updated records for {object_type}"
|
||||
)
|
||||
|
||||
os.remove(csv_path)
|
||||
|
||||
return updated_ids
|
||||
|
||||
def _fetch_from_salesforce(
|
||||
self,
|
||||
temp_dir: str,
|
||||
start: SecondsSinceUnixEpoch | None = None,
|
||||
end: SecondsSinceUnixEpoch | None = None,
|
||||
) -> GenerateDocumentsOutput:
|
||||
logger.info("_fetch_from_salesforce starting.")
|
||||
if not self._sf_client:
|
||||
raise RuntimeError("self._sf_client is None!")
|
||||
|
||||
init_db(temp_dir)
|
||||
|
||||
sqlite_log_stats(temp_dir)
|
||||
|
||||
# Step 1 - download
|
||||
SalesforceConnector._download_object_csvs(
|
||||
temp_dir, self.parent_object_list, self._sf_client, start, end
|
||||
)
|
||||
gc.collect()
|
||||
|
||||
# Step 2 - load CSV's to sqlite
|
||||
updated_ids = SalesforceConnector._load_csvs_to_db(temp_dir, temp_dir)
|
||||
gc.collect()
|
||||
|
||||
logger.info(f"Found {len(updated_ids)} total updated records")
|
||||
logger.info(
|
||||
f"Starting to process parent objects of types: {self.parent_object_list}"
|
||||
)
|
||||
|
||||
docs_to_yield: list[Document] = []
|
||||
# Step 3 - extract and index docs
|
||||
batches_processed = 0
|
||||
docs_processed = 0
|
||||
docs_to_yield: list[Document] = []
|
||||
docs_to_yield_bytes = 0
|
||||
|
||||
# Takes 15-20 seconds per batch
|
||||
for parent_type, parent_id_batch in get_affected_parent_ids_by_type(
|
||||
temp_dir,
|
||||
updated_ids=list(updated_ids),
|
||||
parent_types=self.parent_object_list,
|
||||
):
|
||||
batches_processed += 1
|
||||
logger.info(
|
||||
f"Processing batch of {len(parent_id_batch)} {parent_type} objects"
|
||||
f"Processing batch: index={batches_processed} "
|
||||
f"object_type={parent_type} "
|
||||
f"len={len(parent_id_batch)} "
|
||||
f"processed={docs_processed} "
|
||||
f"remaining={len(updated_ids) - docs_processed}"
|
||||
)
|
||||
for parent_id in parent_id_batch:
|
||||
if not (parent_object := get_record(parent_id, parent_type)):
|
||||
if not (parent_object := get_record(temp_dir, parent_id, parent_type)):
|
||||
logger.warning(
|
||||
f"Failed to get parent object {parent_id} for {parent_type}"
|
||||
)
|
||||
continue
|
||||
|
||||
docs_to_yield.append(
|
||||
convert_sf_object_to_doc(
|
||||
sf_object=parent_object,
|
||||
sf_instance=self.sf_client.sf_instance,
|
||||
)
|
||||
doc = convert_sf_object_to_doc(
|
||||
temp_dir,
|
||||
sf_object=parent_object,
|
||||
sf_instance=self.sf_client.sf_instance,
|
||||
)
|
||||
doc_sizeof = sys.getsizeof(doc)
|
||||
docs_to_yield_bytes += doc_sizeof
|
||||
docs_to_yield.append(doc)
|
||||
docs_processed += 1
|
||||
|
||||
if len(docs_to_yield) >= self.batch_size:
|
||||
# memory usage is sensitive to the input length, so we're yielding immediately
|
||||
# if the batch exceeds a certain byte length
|
||||
if (
|
||||
len(docs_to_yield) >= self.batch_size
|
||||
or docs_to_yield_bytes > SalesforceConnector.MAX_BATCH_BYTES
|
||||
):
|
||||
yield docs_to_yield
|
||||
docs_to_yield = []
|
||||
docs_to_yield_bytes = 0
|
||||
|
||||
# observed a memory leak / size issue with the account table if we don't gc.collect here.
|
||||
gc.collect()
|
||||
|
||||
yield docs_to_yield
|
||||
logger.info(
|
||||
f"Final processing stats: "
|
||||
f"processed={docs_processed} "
|
||||
f"remaining={len(updated_ids) - docs_processed}"
|
||||
)
|
||||
|
||||
def load_from_state(self) -> GenerateDocumentsOutput:
|
||||
return self._fetch_from_salesforce()
|
||||
if MULTI_TENANT:
|
||||
# if multi tenant, we cannot expect the sqlite db to be cached/present
|
||||
with tempfile.TemporaryDirectory() as temp_dir:
|
||||
return self._fetch_from_salesforce(temp_dir)
|
||||
|
||||
# nuke the db since we're starting from scratch
|
||||
sqlite_db_path = get_sqlite_db_path(BASE_DATA_PATH)
|
||||
if os.path.exists(sqlite_db_path):
|
||||
logger.info(f"load_from_state: Removing db at {sqlite_db_path}.")
|
||||
os.remove(sqlite_db_path)
|
||||
return self._fetch_from_salesforce(BASE_DATA_PATH)
|
||||
|
||||
def poll_source(
|
||||
self, start: SecondsSinceUnixEpoch, end: SecondsSinceUnixEpoch
|
||||
) -> GenerateDocumentsOutput:
|
||||
return self._fetch_from_salesforce(start=start, end=end)
|
||||
if MULTI_TENANT:
|
||||
# if multi tenant, we cannot expect the sqlite db to be cached/present
|
||||
with tempfile.TemporaryDirectory() as temp_dir:
|
||||
return self._fetch_from_salesforce(temp_dir, start=start, end=end)
|
||||
|
||||
if start == 0:
|
||||
# nuke the db if we're starting from scratch
|
||||
sqlite_db_path = get_sqlite_db_path(BASE_DATA_PATH)
|
||||
if os.path.exists(sqlite_db_path):
|
||||
logger.info(
|
||||
f"poll_source: Starting at time 0, removing db at {sqlite_db_path}."
|
||||
)
|
||||
os.remove(sqlite_db_path)
|
||||
|
||||
return self._fetch_from_salesforce(BASE_DATA_PATH)
|
||||
|
||||
def retrieve_all_slim_documents(
|
||||
self,
|
||||
@@ -209,7 +360,7 @@ if __name__ == "__main__":
|
||||
"sf_security_token": os.environ["SF_SECURITY_TOKEN"],
|
||||
}
|
||||
)
|
||||
start_time = time.time()
|
||||
start_time = time.monotonic()
|
||||
doc_count = 0
|
||||
section_count = 0
|
||||
text_count = 0
|
||||
@@ -221,7 +372,7 @@ if __name__ == "__main__":
|
||||
for section in doc.sections:
|
||||
if isinstance(section, TextSection) and section.text is not None:
|
||||
text_count += len(section.text)
|
||||
end_time = time.time()
|
||||
end_time = time.monotonic()
|
||||
|
||||
print(f"Doc count: {doc_count}")
|
||||
print(f"Section count: {section_count}")
|
||||
|
||||
@@ -124,13 +124,14 @@ def _extract_section(salesforce_object: SalesforceObject, base_url: str) -> Text
|
||||
|
||||
|
||||
def _extract_primary_owners(
|
||||
directory: str,
|
||||
sf_object: SalesforceObject,
|
||||
) -> list[BasicExpertInfo] | None:
|
||||
object_dict = sf_object.data
|
||||
if not (last_modified_by_id := object_dict.get("LastModifiedById")):
|
||||
logger.warning(f"No LastModifiedById found for {sf_object.id}")
|
||||
return None
|
||||
if not (last_modified_by := get_record(last_modified_by_id)):
|
||||
if not (last_modified_by := get_record(directory, last_modified_by_id)):
|
||||
logger.warning(f"No LastModifiedBy found for {last_modified_by_id}")
|
||||
return None
|
||||
|
||||
@@ -159,6 +160,7 @@ def _extract_primary_owners(
|
||||
|
||||
|
||||
def convert_sf_object_to_doc(
|
||||
directory: str,
|
||||
sf_object: SalesforceObject,
|
||||
sf_instance: str,
|
||||
) -> Document:
|
||||
@@ -170,8 +172,8 @@ def convert_sf_object_to_doc(
|
||||
extracted_semantic_identifier = object_dict.get("Name", "Unknown Object")
|
||||
|
||||
sections = [_extract_section(sf_object, base_url)]
|
||||
for id in get_child_ids(sf_object.id):
|
||||
if not (child_object := get_record(id)):
|
||||
for id in get_child_ids(directory, sf_object.id):
|
||||
if not (child_object := get_record(directory, id)):
|
||||
continue
|
||||
sections.append(_extract_section(child_object, base_url))
|
||||
|
||||
@@ -181,7 +183,7 @@ def convert_sf_object_to_doc(
|
||||
source=DocumentSource.SALESFORCE,
|
||||
semantic_identifier=extracted_semantic_identifier,
|
||||
doc_updated_at=extracted_doc_updated_at,
|
||||
primary_owners=_extract_primary_owners(sf_object),
|
||||
primary_owners=_extract_primary_owners(directory, sf_object),
|
||||
metadata={},
|
||||
)
|
||||
return doc
|
||||
|
||||
@@ -11,13 +11,12 @@ from simple_salesforce.bulk2 import SFBulk2Type
|
||||
|
||||
from onyx.connectors.interfaces import SecondsSinceUnixEpoch
|
||||
from onyx.connectors.salesforce.sqlite_functions import has_at_least_one_object_of_type
|
||||
from onyx.connectors.salesforce.utils import get_object_type_path
|
||||
from onyx.utils.logger import setup_logger
|
||||
|
||||
logger = setup_logger()
|
||||
|
||||
|
||||
def _build_time_filter_for_salesforce(
|
||||
def _build_last_modified_time_filter_for_salesforce(
|
||||
start: SecondsSinceUnixEpoch | None, end: SecondsSinceUnixEpoch | None
|
||||
) -> str:
|
||||
if start is None or end is None:
|
||||
@@ -30,6 +29,19 @@ def _build_time_filter_for_salesforce(
|
||||
)
|
||||
|
||||
|
||||
def _build_created_date_time_filter_for_salesforce(
|
||||
start: SecondsSinceUnixEpoch | None, end: SecondsSinceUnixEpoch | None
|
||||
) -> str:
|
||||
if start is None or end is None:
|
||||
return ""
|
||||
start_datetime = datetime.fromtimestamp(start, UTC)
|
||||
end_datetime = datetime.fromtimestamp(end, UTC)
|
||||
return (
|
||||
f" WHERE CreatedDate > {start_datetime.isoformat()} "
|
||||
f"AND CreatedDate < {end_datetime.isoformat()}"
|
||||
)
|
||||
|
||||
|
||||
def _get_sf_type_object_json(sf_client: Salesforce, type_name: str) -> Any:
|
||||
sf_object = SFType(type_name, sf_client.session_id, sf_client.sf_instance)
|
||||
return sf_object.describe()
|
||||
@@ -109,23 +121,6 @@ def _check_if_object_type_is_empty(
|
||||
return True
|
||||
|
||||
|
||||
def _check_for_existing_csvs(sf_type: str) -> list[str] | None:
|
||||
# Check if the csv already exists
|
||||
if os.path.exists(get_object_type_path(sf_type)):
|
||||
existing_csvs = [
|
||||
os.path.join(get_object_type_path(sf_type), f)
|
||||
for f in os.listdir(get_object_type_path(sf_type))
|
||||
if f.endswith(".csv")
|
||||
]
|
||||
# If the csv already exists, return the path
|
||||
# This is likely due to a previous run that failed
|
||||
# after downloading the csv but before the data was
|
||||
# written to the db
|
||||
if existing_csvs:
|
||||
return existing_csvs
|
||||
return None
|
||||
|
||||
|
||||
def _build_bulk_query(sf_client: Salesforce, sf_type: str, time_filter: str) -> str:
|
||||
queryable_fields = _get_all_queryable_fields_of_sf_type(sf_client, sf_type)
|
||||
query = f"SELECT {', '.join(queryable_fields)} FROM {sf_type}{time_filter}"
|
||||
@@ -133,16 +128,15 @@ def _build_bulk_query(sf_client: Salesforce, sf_type: str, time_filter: str) ->
|
||||
|
||||
|
||||
def _bulk_retrieve_from_salesforce(
|
||||
sf_client: Salesforce,
|
||||
sf_type: str,
|
||||
time_filter: str,
|
||||
sf_client: Salesforce, sf_type: str, time_filter: str, target_dir: str
|
||||
) -> tuple[str, list[str] | None]:
|
||||
"""Returns a tuple of
|
||||
1. the salesforce object type
|
||||
2. the list of CSV's
|
||||
"""
|
||||
if not _check_if_object_type_is_empty(sf_client, sf_type, time_filter):
|
||||
return sf_type, None
|
||||
|
||||
if existing_csvs := _check_for_existing_csvs(sf_type):
|
||||
return sf_type, existing_csvs
|
||||
|
||||
query = _build_bulk_query(sf_client, sf_type, time_filter)
|
||||
|
||||
bulk_2_handler = SFBulk2Handler(
|
||||
@@ -159,20 +153,33 @@ def _bulk_retrieve_from_salesforce(
|
||||
)
|
||||
|
||||
logger.info(f"Downloading {sf_type}")
|
||||
logger.info(f"Query: {query}")
|
||||
|
||||
logger.debug(f"Query: {query}")
|
||||
|
||||
try:
|
||||
# This downloads the file to a file in the target path with a random name
|
||||
results = bulk_2_type.download(
|
||||
query=query,
|
||||
path=get_object_type_path(sf_type),
|
||||
path=target_dir,
|
||||
max_records=1000000,
|
||||
)
|
||||
all_download_paths = [result["file"] for result in results]
|
||||
|
||||
# prepend each downloaded csv with the object type (delimiter = '.')
|
||||
all_download_paths: list[str] = []
|
||||
for result in results:
|
||||
original_file_path = result["file"]
|
||||
directory, filename = os.path.split(original_file_path)
|
||||
new_filename = f"{sf_type}.{filename}"
|
||||
new_file_path = os.path.join(directory, new_filename)
|
||||
os.rename(original_file_path, new_file_path)
|
||||
all_download_paths.append(new_file_path)
|
||||
logger.info(f"Downloaded {sf_type} to {all_download_paths}")
|
||||
return sf_type, all_download_paths
|
||||
except Exception as e:
|
||||
logger.info(f"Failed to download salesforce csv for object type {sf_type}: {e}")
|
||||
logger.error(
|
||||
f"Failed to download salesforce csv for object type {sf_type}: {e}"
|
||||
)
|
||||
logger.warning(f"Exceptioning query for object type {sf_type}: {query}")
|
||||
return sf_type, None
|
||||
|
||||
|
||||
@@ -181,12 +188,35 @@ def fetch_all_csvs_in_parallel(
|
||||
object_types: set[str],
|
||||
start: SecondsSinceUnixEpoch | None,
|
||||
end: SecondsSinceUnixEpoch | None,
|
||||
target_dir: str,
|
||||
) -> dict[str, list[str] | None]:
|
||||
"""
|
||||
Fetches all the csvs in parallel for the given object types
|
||||
Returns a dict of (sf_type, full_download_path)
|
||||
"""
|
||||
time_filter = _build_time_filter_for_salesforce(start, end)
|
||||
|
||||
# these types don't query properly and need looking at
|
||||
# problem_types: set[str] = {
|
||||
# "ContentDocumentLink",
|
||||
# "RecordActionHistory",
|
||||
# "PendingOrderSummary",
|
||||
# "UnifiedActivityRelation",
|
||||
# }
|
||||
|
||||
# these types don't have a LastModifiedDate field and instead use CreatedDate
|
||||
created_date_types: set[str] = {
|
||||
"AccountHistory",
|
||||
"AccountTag",
|
||||
"EntitySubscription",
|
||||
}
|
||||
|
||||
last_modified_time_filter = _build_last_modified_time_filter_for_salesforce(
|
||||
start, end
|
||||
)
|
||||
created_date_time_filter = _build_created_date_time_filter_for_salesforce(
|
||||
start, end
|
||||
)
|
||||
|
||||
time_filter_for_each_object_type = {}
|
||||
# We do this outside of the thread pool executor because this requires
|
||||
# a database connection and we don't want to block the thread pool
|
||||
@@ -195,8 +225,11 @@ def fetch_all_csvs_in_parallel(
|
||||
"""Only add time filter if there is at least one object of the type
|
||||
in the database. We aren't worried about partially completed object update runs
|
||||
because this occurs after we check for existing csvs which covers this case"""
|
||||
if has_at_least_one_object_of_type(sf_type):
|
||||
time_filter_for_each_object_type[sf_type] = time_filter
|
||||
if has_at_least_one_object_of_type(target_dir, sf_type):
|
||||
if sf_type in created_date_types:
|
||||
time_filter_for_each_object_type[sf_type] = created_date_time_filter
|
||||
else:
|
||||
time_filter_for_each_object_type[sf_type] = last_modified_time_filter
|
||||
else:
|
||||
time_filter_for_each_object_type[sf_type] = ""
|
||||
|
||||
@@ -207,6 +240,7 @@ def fetch_all_csvs_in_parallel(
|
||||
sf_client=sf_client,
|
||||
sf_type=object_type,
|
||||
time_filter=time_filter_for_each_object_type[object_type],
|
||||
target_dir=target_dir,
|
||||
),
|
||||
object_types,
|
||||
)
|
||||
|
||||
@@ -2,8 +2,10 @@ import csv
|
||||
import json
|
||||
import os
|
||||
import sqlite3
|
||||
import time
|
||||
from collections.abc import Iterator
|
||||
from contextlib import contextmanager
|
||||
from pathlib import Path
|
||||
|
||||
from onyx.connectors.salesforce.utils import get_sqlite_db_path
|
||||
from onyx.connectors.salesforce.utils import SalesforceObject
|
||||
@@ -16,6 +18,7 @@ logger = setup_logger()
|
||||
|
||||
@contextmanager
|
||||
def get_db_connection(
|
||||
directory: str,
|
||||
isolation_level: str | None = None,
|
||||
) -> Iterator[sqlite3.Connection]:
|
||||
"""Get a database connection with proper isolation level and error handling.
|
||||
@@ -25,7 +28,7 @@ def get_db_connection(
|
||||
can be "IMMEDIATE" or "EXCLUSIVE" for more strict isolation.
|
||||
"""
|
||||
# 60 second timeout for locks
|
||||
conn = sqlite3.connect(get_sqlite_db_path(), timeout=60.0)
|
||||
conn = sqlite3.connect(get_sqlite_db_path(directory), timeout=60.0)
|
||||
|
||||
if isolation_level is not None:
|
||||
conn.isolation_level = isolation_level
|
||||
@@ -38,17 +41,41 @@ def get_db_connection(
|
||||
conn.close()
|
||||
|
||||
|
||||
def init_db() -> None:
|
||||
def sqlite_log_stats(directory: str) -> None:
|
||||
with get_db_connection(directory, "EXCLUSIVE") as conn:
|
||||
cache_pages = conn.execute("PRAGMA cache_size").fetchone()[0]
|
||||
page_size = conn.execute("PRAGMA page_size").fetchone()[0]
|
||||
if cache_pages >= 0:
|
||||
cache_bytes = cache_pages * page_size
|
||||
else:
|
||||
cache_bytes = abs(cache_pages * 1024)
|
||||
logger.info(
|
||||
f"SQLite stats: sqlite_version={sqlite3.sqlite_version} "
|
||||
f"cache_pages={cache_pages} "
|
||||
f"page_size={page_size} "
|
||||
f"cache_bytes={cache_bytes}"
|
||||
)
|
||||
|
||||
|
||||
def init_db(directory: str) -> None:
|
||||
"""Initialize the SQLite database with required tables if they don't exist."""
|
||||
# Create database directory if it doesn't exist
|
||||
os.makedirs(os.path.dirname(get_sqlite_db_path()), exist_ok=True)
|
||||
start = time.monotonic()
|
||||
|
||||
with get_db_connection("EXCLUSIVE") as conn:
|
||||
os.makedirs(os.path.dirname(get_sqlite_db_path(directory)), exist_ok=True)
|
||||
|
||||
with get_db_connection(directory, "EXCLUSIVE") as conn:
|
||||
cursor = conn.cursor()
|
||||
|
||||
db_exists = os.path.exists(get_sqlite_db_path())
|
||||
db_exists = os.path.exists(get_sqlite_db_path(directory))
|
||||
|
||||
if db_exists:
|
||||
file_path = Path(get_sqlite_db_path(directory))
|
||||
file_size = file_path.stat().st_size
|
||||
logger.info(f"init_db - found existing sqlite db: len={file_size}")
|
||||
else:
|
||||
# why is this only if the db doesn't exist?
|
||||
|
||||
if not db_exists:
|
||||
# Enable WAL mode for better concurrent access and write performance
|
||||
cursor.execute("PRAGMA journal_mode=WAL")
|
||||
cursor.execute("PRAGMA synchronous=NORMAL")
|
||||
@@ -143,16 +170,31 @@ def init_db() -> None:
|
||||
""",
|
||||
)
|
||||
|
||||
elapsed = time.monotonic() - start
|
||||
logger.info(f"init_db - create tables and indices: elapsed={elapsed:.2f}")
|
||||
|
||||
# Analyze tables to help query planner
|
||||
cursor.execute("ANALYZE relationships")
|
||||
cursor.execute("ANALYZE salesforce_objects")
|
||||
cursor.execute("ANALYZE relationship_types")
|
||||
cursor.execute("ANALYZE user_email_map")
|
||||
# NOTE(rkuo): skip ANALYZE - it takes too long and we likely don't have
|
||||
# complicated queries that need this
|
||||
# start = time.monotonic()
|
||||
# cursor.execute("ANALYZE relationships")
|
||||
# cursor.execute("ANALYZE salesforce_objects")
|
||||
# cursor.execute("ANALYZE relationship_types")
|
||||
# cursor.execute("ANALYZE user_email_map")
|
||||
# elapsed = time.monotonic() - start
|
||||
# logger.info(f"init_db - analyze: elapsed={elapsed:.2f}")
|
||||
|
||||
# If database already existed but user_email_map needs to be populated
|
||||
start = time.monotonic()
|
||||
cursor.execute("SELECT COUNT(*) FROM user_email_map")
|
||||
elapsed = time.monotonic() - start
|
||||
logger.info(f"init_db - count user_email_map: elapsed={elapsed:.2f}")
|
||||
|
||||
start = time.monotonic()
|
||||
if cursor.fetchone()[0] == 0:
|
||||
_update_user_email_map(conn)
|
||||
elapsed = time.monotonic() - start
|
||||
logger.info(f"init_db - update_user_email_map: elapsed={elapsed:.2f}")
|
||||
|
||||
conn.commit()
|
||||
|
||||
@@ -240,15 +282,15 @@ def _update_user_email_map(conn: sqlite3.Connection) -> None:
|
||||
|
||||
|
||||
def update_sf_db_with_csv(
|
||||
directory: str,
|
||||
object_type: str,
|
||||
csv_download_path: str,
|
||||
delete_csv_after_use: bool = True,
|
||||
) -> list[str]:
|
||||
"""Update the SF DB with a CSV file using SQLite storage."""
|
||||
updated_ids = []
|
||||
|
||||
# Use IMMEDIATE to get a write lock at the start of the transaction
|
||||
with get_db_connection("IMMEDIATE") as conn:
|
||||
with get_db_connection(directory, "IMMEDIATE") as conn:
|
||||
cursor = conn.cursor()
|
||||
|
||||
with open(csv_download_path, "r", newline="", encoding="utf-8") as f:
|
||||
@@ -295,17 +337,12 @@ def update_sf_db_with_csv(
|
||||
|
||||
conn.commit()
|
||||
|
||||
if delete_csv_after_use:
|
||||
# Remove the csv file after it has been used
|
||||
# to successfully update the db
|
||||
os.remove(csv_download_path)
|
||||
|
||||
return updated_ids
|
||||
|
||||
|
||||
def get_child_ids(parent_id: str) -> set[str]:
|
||||
def get_child_ids(directory: str, parent_id: str) -> set[str]:
|
||||
"""Get all child IDs for a given parent ID."""
|
||||
with get_db_connection() as conn:
|
||||
with get_db_connection(directory) as conn:
|
||||
cursor = conn.cursor()
|
||||
|
||||
# Force index usage with INDEXED BY
|
||||
@@ -317,9 +354,9 @@ def get_child_ids(parent_id: str) -> set[str]:
|
||||
return child_ids
|
||||
|
||||
|
||||
def get_type_from_id(object_id: str) -> str | None:
|
||||
def get_type_from_id(directory: str, object_id: str) -> str | None:
|
||||
"""Get the type of an object from its ID."""
|
||||
with get_db_connection() as conn:
|
||||
with get_db_connection(directory) as conn:
|
||||
cursor = conn.cursor()
|
||||
cursor.execute(
|
||||
"SELECT object_type FROM salesforce_objects WHERE id = ?", (object_id,)
|
||||
@@ -332,15 +369,15 @@ def get_type_from_id(object_id: str) -> str | None:
|
||||
|
||||
|
||||
def get_record(
|
||||
object_id: str, object_type: str | None = None
|
||||
directory: str, object_id: str, object_type: str | None = None
|
||||
) -> SalesforceObject | None:
|
||||
"""Retrieve the record and return it as a SalesforceObject."""
|
||||
if object_type is None:
|
||||
object_type = get_type_from_id(object_id)
|
||||
object_type = get_type_from_id(directory, object_id)
|
||||
if not object_type:
|
||||
return None
|
||||
|
||||
with get_db_connection() as conn:
|
||||
with get_db_connection(directory) as conn:
|
||||
cursor = conn.cursor()
|
||||
cursor.execute("SELECT data FROM salesforce_objects WHERE id = ?", (object_id,))
|
||||
result = cursor.fetchone()
|
||||
@@ -352,9 +389,9 @@ def get_record(
|
||||
return SalesforceObject(id=object_id, type=object_type, data=data)
|
||||
|
||||
|
||||
def find_ids_by_type(object_type: str) -> list[str]:
|
||||
def find_ids_by_type(directory: str, object_type: str) -> list[str]:
|
||||
"""Find all object IDs for rows of the specified type."""
|
||||
with get_db_connection() as conn:
|
||||
with get_db_connection(directory) as conn:
|
||||
cursor = conn.cursor()
|
||||
cursor.execute(
|
||||
"SELECT id FROM salesforce_objects WHERE object_type = ?", (object_type,)
|
||||
@@ -363,6 +400,7 @@ def find_ids_by_type(object_type: str) -> list[str]:
|
||||
|
||||
|
||||
def get_affected_parent_ids_by_type(
|
||||
directory: str,
|
||||
updated_ids: list[str],
|
||||
parent_types: list[str],
|
||||
batch_size: int = 500,
|
||||
@@ -374,7 +412,7 @@ def get_affected_parent_ids_by_type(
|
||||
updated_ids_batches = batch_list(updated_ids, batch_size)
|
||||
updated_parent_ids: set[str] = set()
|
||||
|
||||
with get_db_connection() as conn:
|
||||
with get_db_connection(directory) as conn:
|
||||
cursor = conn.cursor()
|
||||
|
||||
for batch_ids in updated_ids_batches:
|
||||
@@ -419,7 +457,7 @@ def get_affected_parent_ids_by_type(
|
||||
yield parent_type, new_affected_ids
|
||||
|
||||
|
||||
def has_at_least_one_object_of_type(object_type: str) -> bool:
|
||||
def has_at_least_one_object_of_type(directory: str, object_type: str) -> bool:
|
||||
"""Check if there is at least one object of the specified type in the database.
|
||||
|
||||
Args:
|
||||
@@ -428,7 +466,7 @@ def has_at_least_one_object_of_type(object_type: str) -> bool:
|
||||
Returns:
|
||||
bool: True if at least one object exists, False otherwise
|
||||
"""
|
||||
with get_db_connection() as conn:
|
||||
with get_db_connection(directory) as conn:
|
||||
cursor = conn.cursor()
|
||||
cursor.execute(
|
||||
"SELECT COUNT(*) FROM salesforce_objects WHERE object_type = ?",
|
||||
@@ -443,7 +481,7 @@ def has_at_least_one_object_of_type(object_type: str) -> bool:
|
||||
NULL_ID_STRING = "N/A"
|
||||
|
||||
|
||||
def get_user_id_by_email(email: str) -> str | None:
|
||||
def get_user_id_by_email(directory: str, email: str) -> str | None:
|
||||
"""Get the Salesforce User ID for a given email address.
|
||||
|
||||
Args:
|
||||
@@ -454,7 +492,7 @@ def get_user_id_by_email(email: str) -> str | None:
|
||||
- was_found: True if the email exists in the table, False if not found
|
||||
- user_id: The Salesforce User ID if exists, None otherwise
|
||||
"""
|
||||
with get_db_connection() as conn:
|
||||
with get_db_connection(directory) as conn:
|
||||
cursor = conn.cursor()
|
||||
cursor.execute("SELECT user_id FROM user_email_map WHERE email = ?", (email,))
|
||||
result = cursor.fetchone()
|
||||
@@ -463,10 +501,10 @@ def get_user_id_by_email(email: str) -> str | None:
|
||||
return result[0]
|
||||
|
||||
|
||||
def update_email_to_id_table(email: str, id: str | None) -> None:
|
||||
def update_email_to_id_table(directory: str, email: str, id: str | None) -> None:
|
||||
"""Update the email to ID map table with a new email and ID."""
|
||||
id_to_use = id or NULL_ID_STRING
|
||||
with get_db_connection() as conn:
|
||||
with get_db_connection(directory) as conn:
|
||||
cursor = conn.cursor()
|
||||
cursor.execute(
|
||||
"INSERT OR REPLACE INTO user_email_map (email, user_id) VALUES (?, ?)",
|
||||
|
||||
@@ -25,16 +25,14 @@ class SalesforceObject:
|
||||
)
|
||||
|
||||
|
||||
# te
|
||||
|
||||
# This defines the base path for all data files relative to this file
|
||||
# AKA BE CAREFUL WHEN MOVING THIS FILE
|
||||
BASE_DATA_PATH = os.path.join(os.path.dirname(__file__), "data")
|
||||
|
||||
|
||||
def get_sqlite_db_path() -> str:
|
||||
def get_sqlite_db_path(directory: str) -> str:
|
||||
"""Get the path to the sqlite db file."""
|
||||
return os.path.join(BASE_DATA_PATH, "salesforce_db.sqlite")
|
||||
return os.path.join(directory, "salesforce_db.sqlite")
|
||||
|
||||
|
||||
def get_object_type_path(object_type: str) -> str:
|
||||
|
||||
@@ -255,7 +255,9 @@ _DISALLOWED_MSG_SUBTYPES = {
|
||||
def default_msg_filter(message: MessageType) -> bool:
|
||||
# Don't keep messages from bots
|
||||
if message.get("bot_id") or message.get("app_id"):
|
||||
if message.get("bot_profile", {}).get("name") == "OnyxConnector":
|
||||
bot_profile_name = message.get("bot_profile", {}).get("name")
|
||||
print(f"bot_profile_name: {bot_profile_name}")
|
||||
if bot_profile_name == "DanswerBot Testing":
|
||||
return False
|
||||
return True
|
||||
|
||||
|
||||
@@ -5,11 +5,13 @@ from typing import cast
|
||||
|
||||
from sqlalchemy.orm import Session
|
||||
|
||||
from onyx.chat.models import ContextualPruningConfig
|
||||
from onyx.chat.models import PromptConfig
|
||||
from onyx.chat.models import SectionRelevancePiece
|
||||
from onyx.chat.prune_and_merge import _merge_sections
|
||||
from onyx.chat.prune_and_merge import ChunkRange
|
||||
from onyx.chat.prune_and_merge import merge_chunk_intervals
|
||||
from onyx.chat.prune_and_merge import prune_and_merge_sections
|
||||
from onyx.configs.chat_configs import DISABLE_LLM_DOC_RELEVANCE
|
||||
from onyx.context.search.enums import LLMEvaluationType
|
||||
from onyx.context.search.enums import QueryFlow
|
||||
@@ -61,6 +63,7 @@ class SearchPipeline:
|
||||
| None = None,
|
||||
rerank_metrics_callback: Callable[[RerankMetricsContainer], None] | None = None,
|
||||
prompt_config: PromptConfig | None = None,
|
||||
contextual_pruning_config: ContextualPruningConfig | None = None,
|
||||
):
|
||||
# NOTE: The Search Request contains a lot of fields that are overrides, many of them can be None
|
||||
# and typically are None. The preprocessing will fetch default values to replace these empty overrides.
|
||||
@@ -77,6 +80,9 @@ class SearchPipeline:
|
||||
self.search_settings = get_current_search_settings(db_session)
|
||||
self.document_index = get_default_document_index(self.search_settings, None)
|
||||
self.prompt_config: PromptConfig | None = prompt_config
|
||||
self.contextual_pruning_config: ContextualPruningConfig | None = (
|
||||
contextual_pruning_config
|
||||
)
|
||||
|
||||
# Preprocessing steps generate this
|
||||
self._search_query: SearchQuery | None = None
|
||||
@@ -221,7 +227,7 @@ class SearchPipeline:
|
||||
|
||||
# If ee is enabled, censor the chunk sections based on user access
|
||||
# Otherwise, return the retrieved chunks
|
||||
censored_chunks = fetch_ee_implementation_or_noop(
|
||||
censored_chunks: list[InferenceChunk] = fetch_ee_implementation_or_noop(
|
||||
"onyx.external_permissions.post_query_censoring",
|
||||
"_post_query_chunk_censoring",
|
||||
retrieved_chunks,
|
||||
@@ -420,7 +426,26 @@ class SearchPipeline:
|
||||
if self._final_context_sections is not None:
|
||||
return self._final_context_sections
|
||||
|
||||
self._final_context_sections = _merge_sections(sections=self.reranked_sections)
|
||||
if (
|
||||
self.contextual_pruning_config is not None
|
||||
and self.prompt_config is not None
|
||||
):
|
||||
self._final_context_sections = prune_and_merge_sections(
|
||||
sections=self.reranked_sections,
|
||||
section_relevance_list=None,
|
||||
prompt_config=self.prompt_config,
|
||||
llm_config=self.llm.config,
|
||||
question=self.search_query.query,
|
||||
contextual_pruning_config=self.contextual_pruning_config,
|
||||
)
|
||||
|
||||
else:
|
||||
logger.error(
|
||||
"Contextual pruning or prompt config not set, using default merge"
|
||||
)
|
||||
self._final_context_sections = _merge_sections(
|
||||
sections=self.reranked_sections
|
||||
)
|
||||
return self._final_context_sections
|
||||
|
||||
@property
|
||||
|
||||
@@ -613,8 +613,19 @@ def fetch_connector_credential_pairs(
|
||||
|
||||
def resync_cc_pair(
|
||||
cc_pair: ConnectorCredentialPair,
|
||||
search_settings_id: int,
|
||||
db_session: Session,
|
||||
) -> None:
|
||||
"""
|
||||
Updates state stored in the connector_credential_pair table based on the
|
||||
latest index attempt for the given search settings.
|
||||
|
||||
Args:
|
||||
cc_pair: ConnectorCredentialPair to resync
|
||||
search_settings_id: SearchSettings to use for resync
|
||||
db_session: Database session
|
||||
"""
|
||||
|
||||
def find_latest_index_attempt(
|
||||
connector_id: int,
|
||||
credential_id: int,
|
||||
@@ -627,11 +638,10 @@ def resync_cc_pair(
|
||||
ConnectorCredentialPair,
|
||||
IndexAttempt.connector_credential_pair_id == ConnectorCredentialPair.id,
|
||||
)
|
||||
.join(SearchSettings, IndexAttempt.search_settings_id == SearchSettings.id)
|
||||
.filter(
|
||||
ConnectorCredentialPair.connector_id == connector_id,
|
||||
ConnectorCredentialPair.credential_id == credential_id,
|
||||
SearchSettings.status == IndexModelStatus.PRESENT,
|
||||
IndexAttempt.search_settings_id == search_settings_id,
|
||||
)
|
||||
)
|
||||
|
||||
|
||||
@@ -43,6 +43,8 @@ from onyx.utils.logger import setup_logger
|
||||
|
||||
logger = setup_logger()
|
||||
|
||||
ONE_HOUR_IN_SECONDS = 60 * 60
|
||||
|
||||
|
||||
def check_docs_exist(db_session: Session) -> bool:
|
||||
stmt = select(exists(DbDocument))
|
||||
@@ -607,6 +609,46 @@ def delete_documents_complete__no_commit(
|
||||
delete_documents__no_commit(db_session, document_ids)
|
||||
|
||||
|
||||
def delete_all_documents_for_connector_credential_pair(
|
||||
db_session: Session,
|
||||
connector_id: int,
|
||||
credential_id: int,
|
||||
timeout: int = ONE_HOUR_IN_SECONDS,
|
||||
) -> None:
|
||||
"""Delete all documents for a given connector credential pair.
|
||||
This will delete all documents and their associated data (chunks, feedback, tags, etc.)
|
||||
|
||||
NOTE: a bit inefficient, but it's not a big deal since this is done rarely - only during
|
||||
an index swap. If we wanted to make this more efficient, we could use a single delete
|
||||
statement + cascade.
|
||||
"""
|
||||
batch_size = 1000
|
||||
start_time = time.monotonic()
|
||||
|
||||
while True:
|
||||
# Get document IDs in batches
|
||||
stmt = (
|
||||
select(DocumentByConnectorCredentialPair.id)
|
||||
.where(
|
||||
DocumentByConnectorCredentialPair.connector_id == connector_id,
|
||||
DocumentByConnectorCredentialPair.credential_id == credential_id,
|
||||
)
|
||||
.limit(batch_size)
|
||||
)
|
||||
document_ids = db_session.scalars(stmt).all()
|
||||
|
||||
if not document_ids:
|
||||
break
|
||||
|
||||
delete_documents_complete__no_commit(
|
||||
db_session=db_session, document_ids=list(document_ids)
|
||||
)
|
||||
db_session.commit()
|
||||
|
||||
if time.monotonic() - start_time > timeout:
|
||||
raise RuntimeError("Timeout reached while deleting documents")
|
||||
|
||||
|
||||
def acquire_document_locks(db_session: Session, document_ids: list[str]) -> bool:
|
||||
"""Acquire locks for the specified documents. Ideally this shouldn't be
|
||||
called with large list of document_ids (an exception could be made if the
|
||||
|
||||
@@ -710,6 +710,25 @@ def cancel_indexing_attempts_past_model(
|
||||
)
|
||||
|
||||
|
||||
def cancel_indexing_attempts_for_search_settings(
|
||||
search_settings_id: int,
|
||||
db_session: Session,
|
||||
) -> None:
|
||||
"""Stops all indexing attempts that are in progress or not started for
|
||||
the specified search settings."""
|
||||
|
||||
db_session.execute(
|
||||
update(IndexAttempt)
|
||||
.where(
|
||||
IndexAttempt.status.in_(
|
||||
[IndexingStatus.IN_PROGRESS, IndexingStatus.NOT_STARTED]
|
||||
),
|
||||
IndexAttempt.search_settings_id == search_settings_id,
|
||||
)
|
||||
.values(status=IndexingStatus.FAILED)
|
||||
)
|
||||
|
||||
|
||||
def count_unique_cc_pairs_with_successful_index_attempts(
|
||||
search_settings_id: int | None,
|
||||
db_session: Session,
|
||||
|
||||
@@ -703,7 +703,11 @@ class Connector(Base):
|
||||
)
|
||||
documents_by_connector: Mapped[
|
||||
list["DocumentByConnectorCredentialPair"]
|
||||
] = relationship("DocumentByConnectorCredentialPair", back_populates="connector")
|
||||
] = relationship(
|
||||
"DocumentByConnectorCredentialPair",
|
||||
back_populates="connector",
|
||||
passive_deletes=True,
|
||||
)
|
||||
|
||||
# synchronize this validation logic with RefreshFrequencySchema etc on front end
|
||||
# until we have a centralized validation schema
|
||||
@@ -757,7 +761,11 @@ class Credential(Base):
|
||||
)
|
||||
documents_by_credential: Mapped[
|
||||
list["DocumentByConnectorCredentialPair"]
|
||||
] = relationship("DocumentByConnectorCredentialPair", back_populates="credential")
|
||||
] = relationship(
|
||||
"DocumentByConnectorCredentialPair",
|
||||
back_populates="credential",
|
||||
passive_deletes=True,
|
||||
)
|
||||
|
||||
user: Mapped[User | None] = relationship("User", back_populates="credentials")
|
||||
|
||||
@@ -1110,10 +1118,10 @@ class DocumentByConnectorCredentialPair(Base):
|
||||
id: Mapped[str] = mapped_column(ForeignKey("document.id"), primary_key=True)
|
||||
# TODO: transition this to use the ConnectorCredentialPair id directly
|
||||
connector_id: Mapped[int] = mapped_column(
|
||||
ForeignKey("connector.id"), primary_key=True
|
||||
ForeignKey("connector.id", ondelete="CASCADE"), primary_key=True
|
||||
)
|
||||
credential_id: Mapped[int] = mapped_column(
|
||||
ForeignKey("credential.id"), primary_key=True
|
||||
ForeignKey("credential.id", ondelete="CASCADE"), primary_key=True
|
||||
)
|
||||
|
||||
# used to better keep track of document counts at a connector level
|
||||
@@ -1123,10 +1131,10 @@ class DocumentByConnectorCredentialPair(Base):
|
||||
has_been_indexed: Mapped[bool] = mapped_column(Boolean)
|
||||
|
||||
connector: Mapped[Connector] = relationship(
|
||||
"Connector", back_populates="documents_by_connector"
|
||||
"Connector", back_populates="documents_by_connector", passive_deletes=True
|
||||
)
|
||||
credential: Mapped[Credential] = relationship(
|
||||
"Credential", back_populates="documents_by_credential"
|
||||
"Credential", back_populates="documents_by_credential", passive_deletes=True
|
||||
)
|
||||
|
||||
__table_args__ = (
|
||||
@@ -1650,8 +1658,8 @@ class Prompt(Base):
|
||||
)
|
||||
name: Mapped[str] = mapped_column(String)
|
||||
description: Mapped[str] = mapped_column(String)
|
||||
system_prompt: Mapped[str] = mapped_column(Text)
|
||||
task_prompt: Mapped[str] = mapped_column(Text)
|
||||
system_prompt: Mapped[str] = mapped_column(String(length=8000))
|
||||
task_prompt: Mapped[str] = mapped_column(String(length=8000))
|
||||
include_citations: Mapped[bool] = mapped_column(Boolean, default=True)
|
||||
datetime_aware: Mapped[bool] = mapped_column(Boolean, default=True)
|
||||
# Default prompts are configured via backend during deployment
|
||||
|
||||
@@ -37,8 +37,8 @@ from onyx.db.models import UserFile
|
||||
from onyx.db.models import UserFolder
|
||||
from onyx.db.models import UserGroup
|
||||
from onyx.db.notification import create_notification
|
||||
from onyx.server.features.persona.models import FullPersonaSnapshot
|
||||
from onyx.server.features.persona.models import PersonaSharedNotificationData
|
||||
from onyx.server.features.persona.models import PersonaSnapshot
|
||||
from onyx.server.features.persona.models import PersonaUpsertRequest
|
||||
from onyx.utils.logger import setup_logger
|
||||
from onyx.utils.variable_functionality import fetch_versioned_implementation
|
||||
@@ -201,7 +201,7 @@ def create_update_persona(
|
||||
create_persona_request: PersonaUpsertRequest,
|
||||
user: User | None,
|
||||
db_session: Session,
|
||||
) -> PersonaSnapshot:
|
||||
) -> FullPersonaSnapshot:
|
||||
"""Higher level function than upsert_persona, although either is valid to use."""
|
||||
# Permission to actually use these is checked later
|
||||
|
||||
@@ -271,7 +271,7 @@ def create_update_persona(
|
||||
logger.exception("Failed to create persona")
|
||||
raise HTTPException(status_code=400, detail=str(e))
|
||||
|
||||
return PersonaSnapshot.from_model(persona)
|
||||
return FullPersonaSnapshot.from_model(persona)
|
||||
|
||||
|
||||
def update_persona_shared_users(
|
||||
|
||||
@@ -3,8 +3,9 @@ from sqlalchemy.orm import Session
|
||||
from onyx.configs.constants import KV_REINDEX_KEY
|
||||
from onyx.db.connector_credential_pair import get_connector_credential_pairs
|
||||
from onyx.db.connector_credential_pair import resync_cc_pair
|
||||
from onyx.db.document import delete_all_documents_for_connector_credential_pair
|
||||
from onyx.db.enums import IndexModelStatus
|
||||
from onyx.db.index_attempt import cancel_indexing_attempts_past_model
|
||||
from onyx.db.index_attempt import cancel_indexing_attempts_for_search_settings
|
||||
from onyx.db.index_attempt import (
|
||||
count_unique_cc_pairs_with_successful_index_attempts,
|
||||
)
|
||||
@@ -26,31 +27,49 @@ def _perform_index_swap(
|
||||
current_search_settings: SearchSettings,
|
||||
secondary_search_settings: SearchSettings,
|
||||
all_cc_pairs: list[ConnectorCredentialPair],
|
||||
cleanup_documents: bool = False,
|
||||
) -> None:
|
||||
"""Swap the indices and expire the old one."""
|
||||
current_search_settings = get_current_search_settings(db_session)
|
||||
update_search_settings_status(
|
||||
search_settings=current_search_settings,
|
||||
new_status=IndexModelStatus.PAST,
|
||||
db_session=db_session,
|
||||
)
|
||||
|
||||
update_search_settings_status(
|
||||
search_settings=secondary_search_settings,
|
||||
new_status=IndexModelStatus.PRESENT,
|
||||
db_session=db_session,
|
||||
)
|
||||
|
||||
if len(all_cc_pairs) > 0:
|
||||
kv_store = get_kv_store()
|
||||
kv_store.store(KV_REINDEX_KEY, False)
|
||||
|
||||
# Expire jobs for the now past index/embedding model
|
||||
cancel_indexing_attempts_past_model(db_session)
|
||||
cancel_indexing_attempts_for_search_settings(
|
||||
search_settings_id=current_search_settings.id,
|
||||
db_session=db_session,
|
||||
)
|
||||
|
||||
# Recount aggregates
|
||||
for cc_pair in all_cc_pairs:
|
||||
resync_cc_pair(cc_pair, db_session=db_session)
|
||||
resync_cc_pair(
|
||||
cc_pair=cc_pair,
|
||||
# sync based on the new search settings
|
||||
search_settings_id=secondary_search_settings.id,
|
||||
db_session=db_session,
|
||||
)
|
||||
|
||||
if cleanup_documents:
|
||||
# clean up all DocumentByConnectorCredentialPair / Document rows, since we're
|
||||
# doing an instant swap and no documents will exist in the new index.
|
||||
for cc_pair in all_cc_pairs:
|
||||
delete_all_documents_for_connector_credential_pair(
|
||||
db_session=db_session,
|
||||
connector_id=cc_pair.connector_id,
|
||||
credential_id=cc_pair.credential_id,
|
||||
)
|
||||
|
||||
# swap over search settings
|
||||
update_search_settings_status(
|
||||
search_settings=current_search_settings,
|
||||
new_status=IndexModelStatus.PAST,
|
||||
db_session=db_session,
|
||||
)
|
||||
update_search_settings_status(
|
||||
search_settings=secondary_search_settings,
|
||||
new_status=IndexModelStatus.PRESENT,
|
||||
db_session=db_session,
|
||||
)
|
||||
|
||||
# remove the old index from the vector db
|
||||
document_index = get_default_document_index(secondary_search_settings, None)
|
||||
@@ -88,6 +107,9 @@ def check_and_perform_index_swap(db_session: Session) -> SearchSettings | None:
|
||||
current_search_settings=current_search_settings,
|
||||
secondary_search_settings=secondary_search_settings,
|
||||
all_cc_pairs=all_cc_pairs,
|
||||
# clean up all DocumentByConnectorCredentialPair / Document rows, since we're
|
||||
# doing an instant swap.
|
||||
cleanup_documents=True,
|
||||
)
|
||||
return current_search_settings
|
||||
|
||||
|
||||
@@ -2,6 +2,7 @@ import io
|
||||
import json
|
||||
import os
|
||||
import re
|
||||
import uuid
|
||||
import zipfile
|
||||
from collections.abc import Callable
|
||||
from collections.abc import Iterator
|
||||
@@ -14,6 +15,7 @@ from pathlib import Path
|
||||
from typing import Any
|
||||
from typing import IO
|
||||
from typing import NamedTuple
|
||||
from typing import Optional
|
||||
|
||||
import chardet
|
||||
import docx # type: ignore
|
||||
@@ -568,8 +570,8 @@ def extract_text_and_images(
|
||||
|
||||
|
||||
def convert_docx_to_txt(
|
||||
file: UploadFile, file_store: FileStore, file_path: str
|
||||
) -> None:
|
||||
file: UploadFile, file_store: FileStore, file_path: Optional[str] = None
|
||||
) -> str:
|
||||
"""
|
||||
Helper to convert docx to a .txt file in the same filestore.
|
||||
"""
|
||||
@@ -581,15 +583,41 @@ def convert_docx_to_txt(
|
||||
all_paras = [p.text for p in doc.paragraphs]
|
||||
text_content = "\n".join(all_paras)
|
||||
|
||||
txt_file_path = docx_to_txt_filename(file_path)
|
||||
file_name = file.filename or f"docx_{uuid.uuid4()}"
|
||||
text_file_name = docx_to_txt_filename(file_path if file_path else file_name)
|
||||
file_store.save_file(
|
||||
file_name=txt_file_path,
|
||||
file_name=text_file_name,
|
||||
content=BytesIO(text_content.encode("utf-8")),
|
||||
display_name=file.filename,
|
||||
file_origin=FileOrigin.CONNECTOR,
|
||||
file_type="text/plain",
|
||||
)
|
||||
return text_file_name
|
||||
|
||||
|
||||
def docx_to_txt_filename(file_path: str) -> str:
|
||||
return file_path.rsplit(".", 1)[0] + ".txt"
|
||||
|
||||
|
||||
def convert_pdf_to_txt(file: UploadFile, file_store: FileStore, file_path: str) -> str:
|
||||
"""
|
||||
Helper to convert PDF to a .txt file in the same filestore.
|
||||
"""
|
||||
file.file.seek(0)
|
||||
|
||||
# Extract text from the PDF
|
||||
text_content, _, _ = read_pdf_file(file.file)
|
||||
|
||||
text_file_name = pdf_to_txt_filename(file_path)
|
||||
file_store.save_file(
|
||||
file_name=text_file_name,
|
||||
content=BytesIO(text_content.encode("utf-8")),
|
||||
display_name=file.filename,
|
||||
file_origin=FileOrigin.CONNECTOR,
|
||||
file_type="text/plain",
|
||||
)
|
||||
return text_file_name
|
||||
|
||||
|
||||
def pdf_to_txt_filename(file_path: str) -> str:
|
||||
return file_path.rsplit(".", 1)[0] + ".txt"
|
||||
|
||||
@@ -459,10 +459,6 @@ def process_image_sections(documents: list[Document]) -> list[IndexingDocument]:
|
||||
llm = get_default_llm_with_vision()
|
||||
|
||||
if not llm:
|
||||
logger.warning(
|
||||
"No vision-capable LLM available. Image sections will not be processed."
|
||||
)
|
||||
|
||||
# Even without LLM, we still convert to IndexingDocument with base Sections
|
||||
return [
|
||||
IndexingDocument(
|
||||
@@ -929,10 +925,12 @@ def index_doc_batch(
|
||||
for chunk_num, chunk in enumerate(chunks_with_embeddings)
|
||||
]
|
||||
|
||||
logger.debug(
|
||||
"Indexing the following chunks: "
|
||||
f"{[chunk.to_short_descriptor() for chunk in access_aware_chunks]}"
|
||||
)
|
||||
short_descriptor_list = [
|
||||
chunk.to_short_descriptor() for chunk in access_aware_chunks
|
||||
]
|
||||
short_descriptor_log = str(short_descriptor_list)[:1024]
|
||||
logger.debug(f"Indexing the following chunks: {short_descriptor_log}")
|
||||
|
||||
# A document will not be spread across different batches, so all the
|
||||
# documents with chunks in this set, are fully represented by the chunks
|
||||
# in this set
|
||||
|
||||
@@ -602,7 +602,7 @@ def get_max_input_tokens(
|
||||
)
|
||||
|
||||
if input_toks <= 0:
|
||||
raise RuntimeError("No tokens for input for the LLM given settings")
|
||||
return GEN_AI_MODEL_FALLBACK_MAX_TOKENS
|
||||
|
||||
return input_toks
|
||||
|
||||
|
||||
@@ -1,3 +1,4 @@
|
||||
import logging
|
||||
import sys
|
||||
import traceback
|
||||
from collections.abc import AsyncGenerator
|
||||
@@ -16,6 +17,7 @@ from fastapi.exceptions import RequestValidationError
|
||||
from fastapi.middleware.cors import CORSMiddleware
|
||||
from fastapi.responses import JSONResponse
|
||||
from httpx_oauth.clients.google import GoogleOAuth2
|
||||
from prometheus_fastapi_instrumentator import Instrumentator
|
||||
from sentry_sdk.integrations.fastapi import FastApiIntegration
|
||||
from sentry_sdk.integrations.starlette import StarletteIntegration
|
||||
from sqlalchemy.orm import Session
|
||||
@@ -102,6 +104,8 @@ from onyx.server.utils import BasicAuthenticationError
|
||||
from onyx.setup import setup_multitenant_onyx
|
||||
from onyx.setup import setup_onyx
|
||||
from onyx.utils.logger import setup_logger
|
||||
from onyx.utils.logger import setup_uvicorn_logger
|
||||
from onyx.utils.middleware import add_onyx_request_id_middleware
|
||||
from onyx.utils.telemetry import get_or_generate_uuid
|
||||
from onyx.utils.telemetry import optional_telemetry
|
||||
from onyx.utils.telemetry import RecordType
|
||||
@@ -116,6 +120,12 @@ from shared_configs.contextvars import CURRENT_TENANT_ID_CONTEXTVAR
|
||||
|
||||
logger = setup_logger()
|
||||
|
||||
file_handlers = [
|
||||
h for h in logger.logger.handlers if isinstance(h, logging.FileHandler)
|
||||
]
|
||||
|
||||
setup_uvicorn_logger(shared_file_handlers=file_handlers)
|
||||
|
||||
|
||||
def validation_exception_handler(request: Request, exc: Exception) -> JSONResponse:
|
||||
if not isinstance(exc, RequestValidationError):
|
||||
@@ -421,9 +431,14 @@ def get_application() -> FastAPI:
|
||||
if LOG_ENDPOINT_LATENCY:
|
||||
add_latency_logging_middleware(application, logger)
|
||||
|
||||
add_onyx_request_id_middleware(application, "API", logger)
|
||||
|
||||
# Ensure all routes have auth enabled or are explicitly marked as public
|
||||
check_router_auth(application)
|
||||
|
||||
# Initialize and instrument the app
|
||||
Instrumentator().instrument(application).expose(application)
|
||||
|
||||
return application
|
||||
|
||||
|
||||
|
||||
@@ -175,7 +175,7 @@ class EmbeddingModel:
|
||||
embeddings: list[Embedding] = []
|
||||
|
||||
def process_batch(
|
||||
batch_idx: int, text_batch: list[str]
|
||||
batch_idx: int, batch_len: int, text_batch: list[str]
|
||||
) -> tuple[int, list[Embedding]]:
|
||||
if self.callback:
|
||||
if self.callback.should_stop():
|
||||
@@ -202,8 +202,8 @@ class EmbeddingModel:
|
||||
end_time = time.time()
|
||||
|
||||
processing_time = end_time - start_time
|
||||
logger.info(
|
||||
f"Batch {batch_idx} processing time: {processing_time:.2f} seconds"
|
||||
logger.debug(
|
||||
f"EmbeddingModel.process_batch: Batch {batch_idx}/{batch_len} processing time: {processing_time:.2f} seconds"
|
||||
)
|
||||
|
||||
return batch_idx, response.embeddings
|
||||
@@ -215,7 +215,7 @@ class EmbeddingModel:
|
||||
if num_threads >= 1 and self.provider_type and len(text_batches) > 1:
|
||||
with ThreadPoolExecutor(max_workers=num_threads) as executor:
|
||||
future_to_batch = {
|
||||
executor.submit(process_batch, idx, batch): idx
|
||||
executor.submit(process_batch, idx, len(text_batches), batch): idx
|
||||
for idx, batch in enumerate(text_batches, start=1)
|
||||
}
|
||||
|
||||
@@ -238,7 +238,7 @@ class EmbeddingModel:
|
||||
else:
|
||||
# Original sequential processing
|
||||
for idx, text_batch in enumerate(text_batches, start=1):
|
||||
_, batch_embeddings = process_batch(idx, text_batch)
|
||||
_, batch_embeddings = process_batch(idx, len(text_batches), text_batch)
|
||||
embeddings.extend(batch_embeddings)
|
||||
if self.callback:
|
||||
self.callback.progress("_batch_encode_texts", 1)
|
||||
|
||||
147
backend/onyx/prompts/agents/dc_prompts.py
Normal file
147
backend/onyx/prompts/agents/dc_prompts.py
Normal file
@@ -0,0 +1,147 @@
|
||||
# Standards
|
||||
SEPARATOR_LINE = "-------"
|
||||
SEPARATOR_LINE_LONG = "---------------"
|
||||
NO_EXTRACTION = "No extraction of knowledge graph objects was feasable."
|
||||
YES = "yes"
|
||||
NO = "no"
|
||||
DC_OBJECT_SEPARATOR = ";"
|
||||
|
||||
|
||||
DC_OBJECT_NO_BASE_DATA_EXTRACTION_PROMPT = f"""
|
||||
You are an expert in finding relevant objects/objext specifications of the same type in a list of documents. \
|
||||
In this case you are interested \
|
||||
in generating: {{objects_of_interest}}.
|
||||
You should look at the documents - in no particular order! - and extract each object you find in the documents.
|
||||
{SEPARATOR_LINE}
|
||||
Here are the documents you are supposed to search through:
|
||||
--
|
||||
{{document_text}}
|
||||
{SEPARATOR_LINE}
|
||||
Here are the task instructions you should use to help you find the desired objects:
|
||||
{SEPARATOR_LINE}
|
||||
{{task}}
|
||||
{SEPARATOR_LINE}
|
||||
Here is the question that may provide critical additional context for the task:
|
||||
{SEPARATOR_LINE}
|
||||
{{question}}
|
||||
{SEPARATOR_LINE}
|
||||
Please answer the question in the following format:
|
||||
REASONING: <your reasoning for the classification> - OBJECTS: <the objects - just their names - that you found, \
|
||||
separated by ';'>
|
||||
""".strip()
|
||||
|
||||
|
||||
DC_OBJECT_WITH_BASE_DATA_EXTRACTION_PROMPT = f"""
|
||||
You are an expert in finding relevant objects/object specifications of the same type in a list of documents. \
|
||||
In this case you are interested \
|
||||
in generating: {{objects_of_interest}}.
|
||||
You should look at the provided data - in no particular order! - and extract each object you find in the documents.
|
||||
{SEPARATOR_LINE}
|
||||
Here are the data provided by the user:
|
||||
--
|
||||
{{base_data}}
|
||||
{SEPARATOR_LINE}
|
||||
Here are the task instructions you should use to help you find the desired objects:
|
||||
{SEPARATOR_LINE}
|
||||
{{task}}
|
||||
{SEPARATOR_LINE}
|
||||
Here is the request that may provide critical additional context for the task:
|
||||
{SEPARATOR_LINE}
|
||||
{{question}}
|
||||
{SEPARATOR_LINE}
|
||||
Please address the request in the following format:
|
||||
REASONING: <your reasoning for the classification> - OBJECTS: <the objects - just their names - that you found, \
|
||||
separated by ';'>
|
||||
""".strip()
|
||||
|
||||
|
||||
DC_OBJECT_SOURCE_RESEARCH_PROMPT = f"""
|
||||
Today is {{today}}. You are an expert in extracting relevant structured information from a list of documents that \
|
||||
should relate to one object. (Try to make sure that you know it relates to that one object!).
|
||||
You should look at the documents - in no particular order! - and extract the information asked for this task:
|
||||
{SEPARATOR_LINE}
|
||||
{{task}}
|
||||
{SEPARATOR_LINE}
|
||||
|
||||
Here is the user question that may provide critical additional context for the task:
|
||||
{SEPARATOR_LINE}
|
||||
{{question}}
|
||||
{SEPARATOR_LINE}
|
||||
|
||||
Here are the documents you are supposed to search through:
|
||||
--
|
||||
{{document_text}}
|
||||
{SEPARATOR_LINE}
|
||||
Note: please cite your sources inline as you generate the results! Use the format [1], etc. Infer the \
|
||||
number from the provided context documents. This is very important!
|
||||
Please address the task in the following format:
|
||||
REASONING:
|
||||
-- <your reasoning for the classification>
|
||||
RESEARCH RESULTS:
|
||||
{{format}}
|
||||
""".strip()
|
||||
|
||||
|
||||
DC_OBJECT_CONSOLIDATION_PROMPT = f"""
|
||||
You are a helpful assistant that consolidates information about a specific object \
|
||||
from multiple sources.
|
||||
The object is:
|
||||
{SEPARATOR_LINE}
|
||||
{{object}}
|
||||
{SEPARATOR_LINE}
|
||||
and the information is
|
||||
{SEPARATOR_LINE}
|
||||
{{information}}
|
||||
{SEPARATOR_LINE}
|
||||
Here is the user question that may provide critical additional context for the task:
|
||||
{SEPARATOR_LINE}
|
||||
{{question}}
|
||||
{SEPARATOR_LINE}
|
||||
|
||||
Please consolidate the information into a single, concise answer. The consolidated informtation \
|
||||
for the object should be in the following format:
|
||||
{SEPARATOR_LINE}
|
||||
{{format}}
|
||||
{SEPARATOR_LINE}
|
||||
Overall, please use this structure to communicate the consolidated information:
|
||||
{SEPARATOR_LINE}
|
||||
REASONING: <your reasoning for consolidating the information>
|
||||
INFORMATION:
|
||||
<consolidated information in the proper format that you have created>
|
||||
"""
|
||||
|
||||
|
||||
DC_FORMATTING_NO_BASE_DATA_PROMPT = f"""
|
||||
You are an expert in text formatting. Your task is to take a given text and convert it 100 percent accurately \
|
||||
in a new format.
|
||||
Here is the text you are supposed to format:
|
||||
{SEPARATOR_LINE}
|
||||
{{text}}
|
||||
{SEPARATOR_LINE}
|
||||
Here is the format you are supposed to use:
|
||||
{SEPARATOR_LINE}
|
||||
{{format}}
|
||||
{SEPARATOR_LINE}
|
||||
Please start the generation directly with the formatted text. (Note that the output should not be code, but text.)
|
||||
"""
|
||||
|
||||
DC_FORMATTING_WITH_BASE_DATA_PROMPT = f"""
|
||||
You are an expert in text formatting. Your task is to take a given text and the initial \
|
||||
base data provided by the user, and convert it 100 percent accurately \
|
||||
in a new format. The base data may also contain important relationships that are critical \
|
||||
for the formatting.
|
||||
Here is the initial data provided by the user:
|
||||
{SEPARATOR_LINE}
|
||||
{{base_data}}
|
||||
{SEPARATOR_LINE}
|
||||
Here is the text you are supposed combine (and format) with the initial data, adhering to the \
|
||||
format instructions provided by later in the prompt:
|
||||
{SEPARATOR_LINE}
|
||||
{{text}}
|
||||
{SEPARATOR_LINE}
|
||||
And here are the format instructions you are supposed to use:
|
||||
{SEPARATOR_LINE}
|
||||
{{format}}
|
||||
{SEPARATOR_LINE}
|
||||
Please start the generation directly with the formatted text. (Note that the output should not be code, but text.)
|
||||
"""
|
||||
@@ -49,6 +49,7 @@ PUBLIC_ENDPOINT_SPECS = [
|
||||
("/auth/oauth/callback", {"GET"}),
|
||||
# anonymous user on cloud
|
||||
("/tenants/anonymous-user", {"POST"}),
|
||||
("/metrics", {"GET"}), # added by prometheus_fastapi_instrumentator
|
||||
]
|
||||
|
||||
|
||||
|
||||
@@ -21,7 +21,7 @@ from onyx.background.celery.tasks.external_group_syncing.tasks import (
|
||||
from onyx.background.celery.tasks.pruning.tasks import (
|
||||
try_creating_prune_generator_task,
|
||||
)
|
||||
from onyx.background.celery.versioned_apps.primary import app as primary_app
|
||||
from onyx.background.celery.versioned_apps.client import app as client_app
|
||||
from onyx.background.indexing.models import IndexAttemptErrorPydantic
|
||||
from onyx.configs.constants import OnyxCeleryPriority
|
||||
from onyx.configs.constants import OnyxCeleryTask
|
||||
@@ -219,7 +219,7 @@ def update_cc_pair_status(
|
||||
continue
|
||||
|
||||
# Revoke the task to prevent it from running
|
||||
primary_app.control.revoke(index_payload.celery_task_id)
|
||||
client_app.control.revoke(index_payload.celery_task_id)
|
||||
|
||||
# If it is running, then signaling for termination will get the
|
||||
# watchdog thread to kill the spawned task
|
||||
@@ -238,7 +238,7 @@ def update_cc_pair_status(
|
||||
db_session.commit()
|
||||
|
||||
# this speeds up the start of indexing by firing the check immediately
|
||||
primary_app.send_task(
|
||||
client_app.send_task(
|
||||
OnyxCeleryTask.CHECK_FOR_INDEXING,
|
||||
kwargs=dict(tenant_id=tenant_id),
|
||||
priority=OnyxCeleryPriority.HIGH,
|
||||
@@ -376,7 +376,7 @@ def prune_cc_pair(
|
||||
f"{cc_pair.connector.name} connector."
|
||||
)
|
||||
payload_id = try_creating_prune_generator_task(
|
||||
primary_app, cc_pair, db_session, r, tenant_id
|
||||
client_app, cc_pair, db_session, r, tenant_id
|
||||
)
|
||||
if not payload_id:
|
||||
raise HTTPException(
|
||||
@@ -450,7 +450,7 @@ def sync_cc_pair(
|
||||
f"{cc_pair.connector.name} connector."
|
||||
)
|
||||
payload_id = try_creating_permissions_sync_task(
|
||||
primary_app, cc_pair_id, r, tenant_id
|
||||
client_app, cc_pair_id, r, tenant_id
|
||||
)
|
||||
if not payload_id:
|
||||
raise HTTPException(
|
||||
@@ -524,7 +524,7 @@ def sync_cc_pair_groups(
|
||||
f"{cc_pair.connector.name} connector."
|
||||
)
|
||||
payload_id = try_creating_external_group_sync_task(
|
||||
primary_app, cc_pair_id, r, tenant_id
|
||||
client_app, cc_pair_id, r, tenant_id
|
||||
)
|
||||
if not payload_id:
|
||||
raise HTTPException(
|
||||
@@ -634,7 +634,7 @@ def associate_credential_to_connector(
|
||||
)
|
||||
|
||||
# trigger indexing immediately
|
||||
primary_app.send_task(
|
||||
client_app.send_task(
|
||||
OnyxCeleryTask.CHECK_FOR_INDEXING,
|
||||
priority=OnyxCeleryPriority.HIGH,
|
||||
kwargs={"tenant_id": tenant_id},
|
||||
|
||||
@@ -20,7 +20,7 @@ from onyx.auth.users import current_admin_user
|
||||
from onyx.auth.users import current_chat_accessible_user
|
||||
from onyx.auth.users import current_curator_or_admin_user
|
||||
from onyx.auth.users import current_user
|
||||
from onyx.background.celery.versioned_apps.primary import app as primary_app
|
||||
from onyx.background.celery.versioned_apps.client import app as client_app
|
||||
from onyx.configs.app_configs import ENABLED_CONNECTOR_TYPES
|
||||
from onyx.configs.app_configs import MOCK_CONNECTOR_FILE_PATH
|
||||
from onyx.configs.constants import DocumentSource
|
||||
@@ -100,6 +100,7 @@ from onyx.db.models import UserGroup__ConnectorCredentialPair
|
||||
from onyx.db.search_settings import get_current_search_settings
|
||||
from onyx.db.search_settings import get_secondary_search_settings
|
||||
from onyx.file_processing.extract_file_text import convert_docx_to_txt
|
||||
from onyx.file_processing.extract_file_text import convert_pdf_to_txt
|
||||
from onyx.file_store.file_store import get_default_file_store
|
||||
from onyx.key_value_store.interface import KvKeyNotFoundError
|
||||
from onyx.redis.redis_connector import RedisConnector
|
||||
@@ -128,6 +129,7 @@ from onyx.utils.telemetry import create_milestone_and_report
|
||||
from onyx.utils.threadpool_concurrency import run_functions_tuples_in_parallel
|
||||
from onyx.utils.variable_functionality import fetch_ee_implementation_or_noop
|
||||
|
||||
|
||||
logger = setup_logger()
|
||||
|
||||
_GMAIL_CREDENTIAL_ID_COOKIE_NAME = "gmail_credential_id"
|
||||
@@ -430,6 +432,23 @@ def upload_files(files: list[UploadFile], db_session: Session) -> FileUploadResp
|
||||
)
|
||||
continue
|
||||
|
||||
# Special handling for docx files - only store the plaintext version
|
||||
if file.content_type and file.content_type.startswith(
|
||||
"application/vnd.openxmlformats-officedocument.wordprocessingml.document"
|
||||
):
|
||||
file_path = os.path.join(str(uuid.uuid4()), cast(str, file.filename))
|
||||
text_file_path = convert_docx_to_txt(file, file_store)
|
||||
deduped_file_paths.append(text_file_path)
|
||||
continue
|
||||
|
||||
# Special handling for PDF files - only store the plaintext version
|
||||
if file.content_type and file.content_type.startswith("application/pdf"):
|
||||
file_path = os.path.join(str(uuid.uuid4()), cast(str, file.filename))
|
||||
text_file_path = convert_pdf_to_txt(file, file_store, file_path)
|
||||
deduped_file_paths.append(text_file_path)
|
||||
continue
|
||||
|
||||
# Default handling for all other file types
|
||||
file_path = os.path.join(str(uuid.uuid4()), cast(str, file.filename))
|
||||
deduped_file_paths.append(file_path)
|
||||
file_store.save_file(
|
||||
@@ -440,11 +459,6 @@ def upload_files(files: list[UploadFile], db_session: Session) -> FileUploadResp
|
||||
file_type=file.content_type or "text/plain",
|
||||
)
|
||||
|
||||
if file.content_type and file.content_type.startswith(
|
||||
"application/vnd.openxmlformats-officedocument.wordprocessingml.document"
|
||||
):
|
||||
convert_docx_to_txt(file, file_store, file_path)
|
||||
|
||||
except ValueError as e:
|
||||
raise HTTPException(status_code=400, detail=str(e))
|
||||
return FileUploadResponse(file_paths=deduped_file_paths)
|
||||
@@ -928,7 +942,7 @@ def create_connector_with_mock_credential(
|
||||
)
|
||||
|
||||
# trigger indexing immediately
|
||||
primary_app.send_task(
|
||||
client_app.send_task(
|
||||
OnyxCeleryTask.CHECK_FOR_INDEXING,
|
||||
priority=OnyxCeleryPriority.HIGH,
|
||||
kwargs={"tenant_id": tenant_id},
|
||||
@@ -1314,7 +1328,7 @@ def trigger_indexing_for_cc_pair(
|
||||
# run the beat task to pick up the triggers immediately
|
||||
priority = OnyxCeleryPriority.HIGHEST if is_user_file else OnyxCeleryPriority.HIGH
|
||||
logger.info(f"Sending indexing check task with priority {priority}")
|
||||
primary_app.send_task(
|
||||
client_app.send_task(
|
||||
OnyxCeleryTask.CHECK_FOR_INDEXING,
|
||||
priority=priority,
|
||||
kwargs={"tenant_id": tenant_id},
|
||||
|
||||
@@ -6,7 +6,7 @@ from sqlalchemy.orm import Session
|
||||
|
||||
from onyx.auth.users import current_curator_or_admin_user
|
||||
from onyx.auth.users import current_user
|
||||
from onyx.background.celery.versioned_apps.primary import app as primary_app
|
||||
from onyx.background.celery.versioned_apps.client import app as client_app
|
||||
from onyx.configs.constants import OnyxCeleryPriority
|
||||
from onyx.configs.constants import OnyxCeleryTask
|
||||
from onyx.db.document_set import check_document_sets_are_public
|
||||
@@ -52,7 +52,7 @@ def create_document_set(
|
||||
except Exception as e:
|
||||
raise HTTPException(status_code=400, detail=str(e))
|
||||
|
||||
primary_app.send_task(
|
||||
client_app.send_task(
|
||||
OnyxCeleryTask.CHECK_FOR_VESPA_SYNC_TASK,
|
||||
kwargs={"tenant_id": tenant_id},
|
||||
priority=OnyxCeleryPriority.HIGH,
|
||||
@@ -85,7 +85,7 @@ def patch_document_set(
|
||||
except Exception as e:
|
||||
raise HTTPException(status_code=400, detail=str(e))
|
||||
|
||||
primary_app.send_task(
|
||||
client_app.send_task(
|
||||
OnyxCeleryTask.CHECK_FOR_VESPA_SYNC_TASK,
|
||||
kwargs={"tenant_id": tenant_id},
|
||||
priority=OnyxCeleryPriority.HIGH,
|
||||
@@ -108,7 +108,7 @@ def delete_document_set(
|
||||
except Exception as e:
|
||||
raise HTTPException(status_code=400, detail=str(e))
|
||||
|
||||
primary_app.send_task(
|
||||
client_app.send_task(
|
||||
OnyxCeleryTask.CHECK_FOR_VESPA_SYNC_TASK,
|
||||
kwargs={"tenant_id": tenant_id},
|
||||
priority=OnyxCeleryPriority.HIGH,
|
||||
|
||||
@@ -43,6 +43,7 @@ from onyx.file_store.models import ChatFileType
|
||||
from onyx.secondary_llm_flows.starter_message_creation import (
|
||||
generate_starter_messages,
|
||||
)
|
||||
from onyx.server.features.persona.models import FullPersonaSnapshot
|
||||
from onyx.server.features.persona.models import GenerateStarterMessageRequest
|
||||
from onyx.server.features.persona.models import ImageGenerationToolStatus
|
||||
from onyx.server.features.persona.models import PersonaLabelCreate
|
||||
@@ -424,8 +425,8 @@ def get_persona(
|
||||
persona_id: int,
|
||||
user: User | None = Depends(current_limited_user),
|
||||
db_session: Session = Depends(get_session),
|
||||
) -> PersonaSnapshot:
|
||||
return PersonaSnapshot.from_model(
|
||||
) -> FullPersonaSnapshot:
|
||||
return FullPersonaSnapshot.from_model(
|
||||
get_persona_by_id(
|
||||
persona_id=persona_id,
|
||||
user=user,
|
||||
|
||||
@@ -91,37 +91,80 @@ class PersonaUpsertRequest(BaseModel):
|
||||
|
||||
class PersonaSnapshot(BaseModel):
|
||||
id: int
|
||||
owner: MinimalUserSnapshot | None
|
||||
name: str
|
||||
is_visible: bool
|
||||
is_public: bool
|
||||
display_priority: int | None
|
||||
description: str
|
||||
num_chunks: float | None
|
||||
llm_relevance_filter: bool
|
||||
llm_filter_extraction: bool
|
||||
llm_model_provider_override: str | None
|
||||
llm_model_version_override: str | None
|
||||
starter_messages: list[StarterMessage] | None
|
||||
builtin_persona: bool
|
||||
prompts: list[PromptSnapshot]
|
||||
tools: list[ToolSnapshot]
|
||||
document_sets: list[DocumentSet]
|
||||
users: list[MinimalUserSnapshot]
|
||||
groups: list[int]
|
||||
icon_color: str | None
|
||||
icon_shape: int | None
|
||||
is_public: bool
|
||||
is_visible: bool
|
||||
icon_shape: int | None = None
|
||||
icon_color: str | None = None
|
||||
uploaded_image_id: str | None = None
|
||||
is_default_persona: bool
|
||||
user_file_ids: list[int] = Field(default_factory=list)
|
||||
user_folder_ids: list[int] = Field(default_factory=list)
|
||||
display_priority: int | None = None
|
||||
is_default_persona: bool = False
|
||||
builtin_persona: bool = False
|
||||
starter_messages: list[StarterMessage] | None = None
|
||||
tools: list[ToolSnapshot] = Field(default_factory=list)
|
||||
labels: list["PersonaLabelSnapshot"] = Field(default_factory=list)
|
||||
owner: MinimalUserSnapshot | None = None
|
||||
users: list[MinimalUserSnapshot] = Field(default_factory=list)
|
||||
groups: list[int] = Field(default_factory=list)
|
||||
document_sets: list[DocumentSet] = Field(default_factory=list)
|
||||
llm_model_provider_override: str | None = None
|
||||
llm_model_version_override: str | None = None
|
||||
num_chunks: float | None = None
|
||||
|
||||
@classmethod
|
||||
def from_model(cls, persona: Persona) -> "PersonaSnapshot":
|
||||
return PersonaSnapshot(
|
||||
id=persona.id,
|
||||
name=persona.name,
|
||||
description=persona.description,
|
||||
is_public=persona.is_public,
|
||||
is_visible=persona.is_visible,
|
||||
icon_shape=persona.icon_shape,
|
||||
icon_color=persona.icon_color,
|
||||
uploaded_image_id=persona.uploaded_image_id,
|
||||
user_file_ids=[file.id for file in persona.user_files],
|
||||
user_folder_ids=[folder.id for folder in persona.user_folders],
|
||||
display_priority=persona.display_priority,
|
||||
is_default_persona=persona.is_default_persona,
|
||||
builtin_persona=persona.builtin_persona,
|
||||
starter_messages=persona.starter_messages,
|
||||
tools=[ToolSnapshot.from_model(tool) for tool in persona.tools],
|
||||
labels=[PersonaLabelSnapshot.from_model(label) for label in persona.labels],
|
||||
owner=(
|
||||
MinimalUserSnapshot(id=persona.user.id, email=persona.user.email)
|
||||
if persona.user
|
||||
else None
|
||||
),
|
||||
users=[
|
||||
MinimalUserSnapshot(id=user.id, email=user.email)
|
||||
for user in persona.users
|
||||
],
|
||||
groups=[user_group.id for user_group in persona.groups],
|
||||
document_sets=[
|
||||
DocumentSet.from_model(document_set_model)
|
||||
for document_set_model in persona.document_sets
|
||||
],
|
||||
llm_model_provider_override=persona.llm_model_provider_override,
|
||||
llm_model_version_override=persona.llm_model_version_override,
|
||||
num_chunks=persona.num_chunks,
|
||||
)
|
||||
|
||||
|
||||
# Model with full context on perona's internal settings
|
||||
# This is used for flows which need to know all settings
|
||||
class FullPersonaSnapshot(PersonaSnapshot):
|
||||
search_start_date: datetime | None = None
|
||||
labels: list["PersonaLabelSnapshot"] = []
|
||||
user_file_ids: list[int] | None = None
|
||||
user_folder_ids: list[int] | None = None
|
||||
prompts: list[PromptSnapshot] = Field(default_factory=list)
|
||||
llm_relevance_filter: bool = False
|
||||
llm_filter_extraction: bool = False
|
||||
|
||||
@classmethod
|
||||
def from_model(
|
||||
cls, persona: Persona, allow_deleted: bool = False
|
||||
) -> "PersonaSnapshot":
|
||||
) -> "FullPersonaSnapshot":
|
||||
if persona.deleted:
|
||||
error_msg = f"Persona with ID {persona.id} has been deleted"
|
||||
if not allow_deleted:
|
||||
@@ -129,44 +172,32 @@ class PersonaSnapshot(BaseModel):
|
||||
else:
|
||||
logger.warning(error_msg)
|
||||
|
||||
return PersonaSnapshot(
|
||||
return FullPersonaSnapshot(
|
||||
id=persona.id,
|
||||
name=persona.name,
|
||||
description=persona.description,
|
||||
is_public=persona.is_public,
|
||||
is_visible=persona.is_visible,
|
||||
icon_shape=persona.icon_shape,
|
||||
icon_color=persona.icon_color,
|
||||
uploaded_image_id=persona.uploaded_image_id,
|
||||
user_file_ids=[file.id for file in persona.user_files],
|
||||
user_folder_ids=[folder.id for folder in persona.user_folders],
|
||||
display_priority=persona.display_priority,
|
||||
is_default_persona=persona.is_default_persona,
|
||||
builtin_persona=persona.builtin_persona,
|
||||
starter_messages=persona.starter_messages,
|
||||
tools=[ToolSnapshot.from_model(tool) for tool in persona.tools],
|
||||
labels=[PersonaLabelSnapshot.from_model(label) for label in persona.labels],
|
||||
owner=(
|
||||
MinimalUserSnapshot(id=persona.user.id, email=persona.user.email)
|
||||
if persona.user
|
||||
else None
|
||||
),
|
||||
is_visible=persona.is_visible,
|
||||
is_public=persona.is_public,
|
||||
display_priority=persona.display_priority,
|
||||
description=persona.description,
|
||||
num_chunks=persona.num_chunks,
|
||||
search_start_date=persona.search_start_date,
|
||||
prompts=[PromptSnapshot.from_model(prompt) for prompt in persona.prompts],
|
||||
llm_relevance_filter=persona.llm_relevance_filter,
|
||||
llm_filter_extraction=persona.llm_filter_extraction,
|
||||
llm_model_provider_override=persona.llm_model_provider_override,
|
||||
llm_model_version_override=persona.llm_model_version_override,
|
||||
starter_messages=persona.starter_messages,
|
||||
builtin_persona=persona.builtin_persona,
|
||||
is_default_persona=persona.is_default_persona,
|
||||
prompts=[PromptSnapshot.from_model(prompt) for prompt in persona.prompts],
|
||||
tools=[ToolSnapshot.from_model(tool) for tool in persona.tools],
|
||||
document_sets=[
|
||||
DocumentSet.from_model(document_set_model)
|
||||
for document_set_model in persona.document_sets
|
||||
],
|
||||
users=[
|
||||
MinimalUserSnapshot(id=user.id, email=user.email)
|
||||
for user in persona.users
|
||||
],
|
||||
groups=[user_group.id for user_group in persona.groups],
|
||||
icon_color=persona.icon_color,
|
||||
icon_shape=persona.icon_shape,
|
||||
uploaded_image_id=persona.uploaded_image_id,
|
||||
search_start_date=persona.search_start_date,
|
||||
labels=[PersonaLabelSnapshot.from_model(label) for label in persona.labels],
|
||||
user_file_ids=[file.id for file in persona.user_files],
|
||||
user_folder_ids=[folder.id for folder in persona.user_folders],
|
||||
)
|
||||
|
||||
|
||||
|
||||
@@ -10,7 +10,7 @@ from sqlalchemy.orm import Session
|
||||
|
||||
from onyx.auth.users import current_admin_user
|
||||
from onyx.auth.users import current_curator_or_admin_user
|
||||
from onyx.background.celery.versioned_apps.primary import app as primary_app
|
||||
from onyx.background.celery.versioned_apps.client import app as client_app
|
||||
from onyx.configs.app_configs import GENERATIVE_MODEL_ACCESS_CHECK_FREQ
|
||||
from onyx.configs.constants import DocumentSource
|
||||
from onyx.configs.constants import KV_GEN_AI_KEY_CHECK_TIME
|
||||
@@ -192,7 +192,7 @@ def create_deletion_attempt_for_connector_id(
|
||||
db_session.commit()
|
||||
|
||||
# run the beat task to pick up this deletion from the db immediately
|
||||
primary_app.send_task(
|
||||
client_app.send_task(
|
||||
OnyxCeleryTask.CHECK_FOR_CONNECTOR_DELETION,
|
||||
priority=OnyxCeleryPriority.HIGH,
|
||||
kwargs={"tenant_id": tenant_id},
|
||||
|
||||
@@ -19,6 +19,7 @@ from onyx.db.models import SlackBot as SlackAppModel
|
||||
from onyx.db.models import SlackChannelConfig as SlackChannelConfigModel
|
||||
from onyx.db.models import User
|
||||
from onyx.onyxbot.slack.config import VALID_SLACK_FILTERS
|
||||
from onyx.server.features.persona.models import FullPersonaSnapshot
|
||||
from onyx.server.features.persona.models import PersonaSnapshot
|
||||
from onyx.server.models import FullUserSnapshot
|
||||
from onyx.server.models import InvitedUserSnapshot
|
||||
@@ -245,7 +246,7 @@ class SlackChannelConfig(BaseModel):
|
||||
id=slack_channel_config_model.id,
|
||||
slack_bot_id=slack_channel_config_model.slack_bot_id,
|
||||
persona=(
|
||||
PersonaSnapshot.from_model(
|
||||
FullPersonaSnapshot.from_model(
|
||||
slack_channel_config_model.persona, allow_deleted=True
|
||||
)
|
||||
if slack_channel_config_model.persona
|
||||
|
||||
@@ -117,7 +117,11 @@ def set_new_search_settings(
|
||||
search_settings_id=search_settings.id, db_session=db_session
|
||||
)
|
||||
for cc_pair in get_connector_credential_pairs(db_session):
|
||||
resync_cc_pair(cc_pair, db_session=db_session)
|
||||
resync_cc_pair(
|
||||
cc_pair=cc_pair,
|
||||
search_settings_id=new_search_settings.id,
|
||||
db_session=db_session,
|
||||
)
|
||||
|
||||
db_session.commit()
|
||||
return IdReturn(id=new_search_settings.id)
|
||||
|
||||
@@ -96,7 +96,11 @@ def setup_onyx(
|
||||
)
|
||||
|
||||
for cc_pair in get_connector_credential_pairs(db_session):
|
||||
resync_cc_pair(cc_pair, db_session=db_session)
|
||||
resync_cc_pair(
|
||||
cc_pair=cc_pair,
|
||||
search_settings_id=search_settings.id,
|
||||
db_session=db_session,
|
||||
)
|
||||
|
||||
# Expire all old embedding models indexing attempts, technically redundant
|
||||
cancel_indexing_attempts_past_model(db_session)
|
||||
|
||||
@@ -1,4 +1,5 @@
|
||||
from collections.abc import Callable
|
||||
from datetime import datetime
|
||||
from typing import Any
|
||||
from uuid import UUID
|
||||
|
||||
@@ -6,6 +7,7 @@ from pydantic import BaseModel
|
||||
from pydantic import model_validator
|
||||
from sqlalchemy.orm import Session
|
||||
|
||||
from onyx.configs.constants import DocumentSource
|
||||
from onyx.context.search.enums import SearchType
|
||||
from onyx.context.search.models import IndexFilters
|
||||
from onyx.context.search.models import InferenceSection
|
||||
@@ -75,6 +77,8 @@ class SearchToolOverrideKwargs(BaseModel):
|
||||
ordering_only: bool | None = (
|
||||
None # Flag for fast path when search is only needed for ordering
|
||||
)
|
||||
document_sources: list[DocumentSource] | None = None
|
||||
time_cutoff: datetime | None = None
|
||||
|
||||
class Config:
|
||||
arbitrary_types_allowed = True
|
||||
|
||||
@@ -292,6 +292,8 @@ class SearchTool(Tool[SearchToolOverrideKwargs]):
|
||||
user_file_ids = None
|
||||
user_folder_ids = None
|
||||
ordering_only = False
|
||||
document_sources = None
|
||||
time_cutoff = None
|
||||
if override_kwargs:
|
||||
force_no_rerank = use_alt_not_None(override_kwargs.force_no_rerank, False)
|
||||
alternate_db_session = override_kwargs.alternate_db_session
|
||||
@@ -302,6 +304,8 @@ class SearchTool(Tool[SearchToolOverrideKwargs]):
|
||||
user_file_ids = override_kwargs.user_file_ids
|
||||
user_folder_ids = override_kwargs.user_folder_ids
|
||||
ordering_only = use_alt_not_None(override_kwargs.ordering_only, False)
|
||||
document_sources = override_kwargs.document_sources
|
||||
time_cutoff = override_kwargs.time_cutoff
|
||||
|
||||
# Fast path for ordering-only search
|
||||
if ordering_only:
|
||||
@@ -334,6 +338,23 @@ class SearchTool(Tool[SearchToolOverrideKwargs]):
|
||||
)
|
||||
retrieval_options = RetrievalDetails(filters=filters)
|
||||
|
||||
if document_sources or time_cutoff:
|
||||
# Get retrieval_options and filters, or create if they don't exist
|
||||
retrieval_options = retrieval_options or RetrievalDetails()
|
||||
retrieval_options.filters = retrieval_options.filters or BaseFilters()
|
||||
|
||||
# Handle document sources
|
||||
if document_sources:
|
||||
source_types = retrieval_options.filters.source_type or []
|
||||
retrieval_options.filters.source_type = list(
|
||||
set(source_types + document_sources)
|
||||
)
|
||||
|
||||
# Handle time cutoff
|
||||
if time_cutoff:
|
||||
# Overwrite time-cutoff should supercede existing time-cutoff, even if defined
|
||||
retrieval_options.filters.time_cutoff = time_cutoff
|
||||
|
||||
search_pipeline = SearchPipeline(
|
||||
search_request=SearchRequest(
|
||||
query=query,
|
||||
@@ -376,6 +397,7 @@ class SearchTool(Tool[SearchToolOverrideKwargs]):
|
||||
db_session=alternate_db_session or self.db_session,
|
||||
prompt_config=self.prompt_config,
|
||||
retrieved_sections_callback=retrieved_sections_callback,
|
||||
contextual_pruning_config=self.contextual_pruning_config,
|
||||
)
|
||||
|
||||
search_query_info = SearchQueryInfo(
|
||||
@@ -447,6 +469,7 @@ class SearchTool(Tool[SearchToolOverrideKwargs]):
|
||||
db_session=self.db_session,
|
||||
bypass_acl=self.bypass_acl,
|
||||
prompt_config=self.prompt_config,
|
||||
contextual_pruning_config=self.contextual_pruning_config,
|
||||
)
|
||||
|
||||
# Log what we're doing
|
||||
|
||||
@@ -13,6 +13,7 @@ from shared_configs.configs import POSTGRES_DEFAULT_SCHEMA
|
||||
from shared_configs.configs import SLACK_CHANNEL_ID
|
||||
from shared_configs.configs import TENANT_ID_PREFIX
|
||||
from shared_configs.contextvars import CURRENT_TENANT_ID_CONTEXTVAR
|
||||
from shared_configs.contextvars import ONYX_REQUEST_ID_CONTEXTVAR
|
||||
|
||||
|
||||
logging.addLevelName(logging.INFO + 5, "NOTICE")
|
||||
@@ -71,6 +72,14 @@ def get_log_level_from_str(log_level_str: str = LOG_LEVEL) -> int:
|
||||
return log_level_dict.get(log_level_str.upper(), logging.getLevelName("NOTICE"))
|
||||
|
||||
|
||||
class OnyxRequestIDFilter(logging.Filter):
|
||||
def filter(self, record: logging.LogRecord) -> bool:
|
||||
from shared_configs.contextvars import ONYX_REQUEST_ID_CONTEXTVAR
|
||||
|
||||
record.request_id = ONYX_REQUEST_ID_CONTEXTVAR.get() or "-"
|
||||
return True
|
||||
|
||||
|
||||
class OnyxLoggingAdapter(logging.LoggerAdapter):
|
||||
def process(
|
||||
self, msg: str, kwargs: MutableMapping[str, Any]
|
||||
@@ -103,6 +112,7 @@ class OnyxLoggingAdapter(logging.LoggerAdapter):
|
||||
msg = f"[CC Pair: {cc_pair_id}] {msg}"
|
||||
|
||||
break
|
||||
|
||||
# Add tenant information if it differs from default
|
||||
# This will always be the case for authenticated API requests
|
||||
if MULTI_TENANT:
|
||||
@@ -115,6 +125,11 @@ class OnyxLoggingAdapter(logging.LoggerAdapter):
|
||||
)
|
||||
msg = f"[t:{short_tenant}] {msg}"
|
||||
|
||||
# request id within a fastapi route
|
||||
fastapi_request_id = ONYX_REQUEST_ID_CONTEXTVAR.get()
|
||||
if fastapi_request_id:
|
||||
msg = f"[{fastapi_request_id}] {msg}"
|
||||
|
||||
# For Slack Bot, logs the channel relevant to the request
|
||||
channel_id = self.extra.get(SLACK_CHANNEL_ID) if self.extra else None
|
||||
if channel_id:
|
||||
@@ -165,6 +180,14 @@ class ColoredFormatter(logging.Formatter):
|
||||
return super().format(record)
|
||||
|
||||
|
||||
def get_uvicorn_standard_formatter() -> ColoredFormatter:
|
||||
"""Returns a standard colored logging formatter."""
|
||||
return ColoredFormatter(
|
||||
"%(asctime)s %(filename)30s %(lineno)4s: [%(request_id)s] %(message)s",
|
||||
datefmt="%m/%d/%Y %I:%M:%S %p",
|
||||
)
|
||||
|
||||
|
||||
def get_standard_formatter() -> ColoredFormatter:
|
||||
"""Returns a standard colored logging formatter."""
|
||||
return ColoredFormatter(
|
||||
@@ -201,12 +224,6 @@ def setup_logger(
|
||||
|
||||
logger.addHandler(handler)
|
||||
|
||||
uvicorn_logger = logging.getLogger("uvicorn.access")
|
||||
if uvicorn_logger:
|
||||
uvicorn_logger.handlers = []
|
||||
uvicorn_logger.addHandler(handler)
|
||||
uvicorn_logger.setLevel(log_level)
|
||||
|
||||
is_containerized = is_running_in_container()
|
||||
if LOG_FILE_NAME and (is_containerized or DEV_LOGGING_ENABLED):
|
||||
log_levels = ["debug", "info", "notice"]
|
||||
@@ -225,14 +242,37 @@ def setup_logger(
|
||||
file_handler.setFormatter(formatter)
|
||||
logger.addHandler(file_handler)
|
||||
|
||||
if uvicorn_logger:
|
||||
uvicorn_logger.addHandler(file_handler)
|
||||
|
||||
logger.notice = lambda msg, *args, **kwargs: logger.log(logging.getLevelName("NOTICE"), msg, *args, **kwargs) # type: ignore
|
||||
|
||||
return OnyxLoggingAdapter(logger, extra=extra)
|
||||
|
||||
|
||||
def setup_uvicorn_logger(
|
||||
log_level: int = get_log_level_from_str(),
|
||||
shared_file_handlers: list[logging.FileHandler] | None = None,
|
||||
) -> None:
|
||||
uvicorn_logger = logging.getLogger("uvicorn.access")
|
||||
if not uvicorn_logger:
|
||||
return
|
||||
|
||||
formatter = get_uvicorn_standard_formatter()
|
||||
|
||||
handler = logging.StreamHandler()
|
||||
handler.setLevel(log_level)
|
||||
handler.setFormatter(formatter)
|
||||
|
||||
uvicorn_logger.handlers = []
|
||||
uvicorn_logger.addHandler(handler)
|
||||
uvicorn_logger.setLevel(log_level)
|
||||
uvicorn_logger.addFilter(OnyxRequestIDFilter())
|
||||
|
||||
if shared_file_handlers:
|
||||
for fh in shared_file_handlers:
|
||||
uvicorn_logger.addHandler(fh)
|
||||
|
||||
return
|
||||
|
||||
|
||||
def print_loggers() -> None:
|
||||
"""Print information about all loggers. Use to debug logging issues."""
|
||||
root_logger = logging.getLogger()
|
||||
|
||||
62
backend/onyx/utils/middleware.py
Normal file
62
backend/onyx/utils/middleware.py
Normal file
@@ -0,0 +1,62 @@
|
||||
import base64
|
||||
import hashlib
|
||||
import logging
|
||||
import uuid
|
||||
from collections.abc import Awaitable
|
||||
from collections.abc import Callable
|
||||
from datetime import datetime
|
||||
from datetime import timezone
|
||||
|
||||
from fastapi import FastAPI
|
||||
from fastapi import Request
|
||||
from fastapi import Response
|
||||
|
||||
from shared_configs.contextvars import ONYX_REQUEST_ID_CONTEXTVAR
|
||||
|
||||
|
||||
def add_onyx_request_id_middleware(
|
||||
app: FastAPI, prefix: str, logger: logging.LoggerAdapter
|
||||
) -> None:
|
||||
@app.middleware("http")
|
||||
async def set_request_id(
|
||||
request: Request, call_next: Callable[[Request], Awaitable[Response]]
|
||||
) -> Response:
|
||||
"""Generate a request hash that can be used to track the lifecycle
|
||||
of a request. The hash is prefixed to help indicated where the request id
|
||||
originated.
|
||||
|
||||
Format is f"{PREFIX}:{ID}" where PREFIX is 3 chars and ID is 8 chars.
|
||||
Total length is 12 chars.
|
||||
"""
|
||||
|
||||
onyx_request_id = request.headers.get("X-Onyx-Request-ID")
|
||||
if not onyx_request_id:
|
||||
onyx_request_id = make_randomized_onyx_request_id(prefix)
|
||||
|
||||
ONYX_REQUEST_ID_CONTEXTVAR.set(onyx_request_id)
|
||||
return await call_next(request)
|
||||
|
||||
|
||||
def make_randomized_onyx_request_id(prefix: str) -> str:
|
||||
"""generates a randomized request id"""
|
||||
|
||||
hash_input = str(uuid.uuid4())
|
||||
return _make_onyx_request_id(prefix, hash_input)
|
||||
|
||||
|
||||
def make_structured_onyx_request_id(prefix: str, request_url: str) -> str:
|
||||
"""Not used yet, but could be in the future!"""
|
||||
hash_input = f"{request_url}:{datetime.now(timezone.utc)}"
|
||||
return _make_onyx_request_id(prefix, hash_input)
|
||||
|
||||
|
||||
def _make_onyx_request_id(prefix: str, hash_input: str) -> str:
|
||||
"""helper function to return an id given a string input"""
|
||||
hash_obj = hashlib.md5(hash_input.encode("utf-8"))
|
||||
hash_bytes = hash_obj.digest()[:6] # Truncate to 6 bytes
|
||||
|
||||
# 6 bytes becomes 8 bytes. we shouldn't need to strip but just in case
|
||||
# NOTE: possible we'll want more input bytes if id's aren't unique enough
|
||||
hash_str = base64.urlsafe_b64encode(hash_bytes).decode("utf-8").rstrip("=")
|
||||
onyx_request_id = f"{prefix}:{hash_str}"
|
||||
return onyx_request_id
|
||||
@@ -332,14 +332,15 @@ def wait_on_background(task: TimeoutThread[R]) -> R:
|
||||
return task.result
|
||||
|
||||
|
||||
def _next_or_none(ind: int, g: Iterator[R]) -> tuple[int, R | None]:
|
||||
return ind, next(g, None)
|
||||
def _next_or_none(ind: int, gen: Iterator[R]) -> tuple[int, R | None]:
|
||||
return ind, next(gen, None)
|
||||
|
||||
|
||||
def parallel_yield(gens: list[Iterator[R]], max_workers: int = 10) -> Iterator[R]:
|
||||
with ThreadPoolExecutor(max_workers=max_workers) as executor:
|
||||
future_to_index: dict[Future[tuple[int, R | None]], int] = {
|
||||
executor.submit(_next_or_none, i, g): i for i, g in enumerate(gens)
|
||||
executor.submit(_next_or_none, ind, gen): ind
|
||||
for ind, gen in enumerate(gens)
|
||||
}
|
||||
|
||||
next_ind = len(gens)
|
||||
|
||||
@@ -95,4 +95,5 @@ urllib3==2.2.3
|
||||
mistune==0.8.4
|
||||
sentry-sdk==2.14.0
|
||||
prometheus_client==0.21.0
|
||||
fastapi-limiter==0.1.6
|
||||
fastapi-limiter==0.1.6
|
||||
prometheus_fastapi_instrumentator==7.1.0
|
||||
|
||||
@@ -15,4 +15,5 @@ uvicorn==0.21.1
|
||||
voyageai==0.2.3
|
||||
litellm==1.61.16
|
||||
sentry-sdk[fastapi,celery,starlette]==2.14.0
|
||||
aioboto3==13.4.0
|
||||
aioboto3==13.4.0
|
||||
prometheus_fastapi_instrumentator==7.1.0
|
||||
|
||||
@@ -58,6 +58,7 @@ INDEXING_ONLY = os.environ.get("INDEXING_ONLY", "").lower() == "true"
|
||||
|
||||
# The process needs to have this for the log file to write to
|
||||
# otherwise, it will not create additional log files
|
||||
# This should just be the filename base without extension or path.
|
||||
LOG_FILE_NAME = os.environ.get("LOG_FILE_NAME") or "onyx"
|
||||
|
||||
# Enable generating persistent log files for local dev environments
|
||||
|
||||
@@ -11,6 +11,15 @@ CURRENT_TENANT_ID_CONTEXTVAR: contextvars.ContextVar[
|
||||
"current_tenant_id", default=None if MULTI_TENANT else POSTGRES_DEFAULT_SCHEMA
|
||||
)
|
||||
|
||||
# set by every route in the API server
|
||||
INDEXING_REQUEST_ID_CONTEXTVAR: contextvars.ContextVar[
|
||||
str | None
|
||||
] = contextvars.ContextVar("indexing_request_id", default=None)
|
||||
|
||||
# set by every route in the API server
|
||||
ONYX_REQUEST_ID_CONTEXTVAR: contextvars.ContextVar[str | None] = contextvars.ContextVar(
|
||||
"onyx_request_id", default=None
|
||||
)
|
||||
|
||||
"""Utils related to contextvars"""
|
||||
|
||||
|
||||
@@ -34,7 +34,7 @@ def confluence_connector(space: str) -> ConfluenceConnector:
|
||||
return connector
|
||||
|
||||
|
||||
@pytest.mark.parametrize("space", [os.environ["CONFLUENCE_TEST_SPACE"]])
|
||||
@pytest.mark.parametrize("space", [os.getenv("CONFLUENCE_TEST_SPACE") or "DailyConne"])
|
||||
@patch(
|
||||
"onyx.file_processing.extract_file_text.get_unstructured_api_key",
|
||||
return_value=None,
|
||||
|
||||
44
backend/tests/daily/connectors/gong/test_gong.py
Normal file
44
backend/tests/daily/connectors/gong/test_gong.py
Normal file
@@ -0,0 +1,44 @@
|
||||
import os
|
||||
import time
|
||||
from unittest.mock import MagicMock
|
||||
from unittest.mock import patch
|
||||
|
||||
import pytest
|
||||
|
||||
from onyx.connectors.gong.connector import GongConnector
|
||||
from onyx.connectors.models import Document
|
||||
|
||||
|
||||
@pytest.fixture
|
||||
def gong_connector() -> GongConnector:
|
||||
connector = GongConnector()
|
||||
|
||||
connector.load_credentials(
|
||||
{
|
||||
"gong_access_key": os.environ["GONG_ACCESS_KEY"],
|
||||
"gong_access_key_secret": os.environ["GONG_ACCESS_KEY_SECRET"],
|
||||
}
|
||||
)
|
||||
|
||||
return connector
|
||||
|
||||
|
||||
@patch(
|
||||
"onyx.file_processing.extract_file_text.get_unstructured_api_key",
|
||||
return_value=None,
|
||||
)
|
||||
def test_gong_basic(mock_get_api_key: MagicMock, gong_connector: GongConnector) -> None:
|
||||
doc_batch_generator = gong_connector.poll_source(0, time.time())
|
||||
|
||||
doc_batch = next(doc_batch_generator)
|
||||
with pytest.raises(StopIteration):
|
||||
next(doc_batch_generator)
|
||||
|
||||
assert len(doc_batch) == 2
|
||||
|
||||
docs: list[Document] = []
|
||||
for doc in doc_batch:
|
||||
docs.append(doc)
|
||||
|
||||
assert docs[0].semantic_identifier == "test with chris"
|
||||
assert docs[1].semantic_identifier == "Testing Gong"
|
||||
@@ -1,6 +1,7 @@
|
||||
import json
|
||||
import os
|
||||
import time
|
||||
from datetime import datetime
|
||||
from pathlib import Path
|
||||
from unittest.mock import MagicMock
|
||||
from unittest.mock import patch
|
||||
@@ -105,6 +106,54 @@ def test_highspot_connector_slim(
|
||||
assert len(all_slim_doc_ids) > 0
|
||||
|
||||
|
||||
@patch(
|
||||
"onyx.file_processing.extract_file_text.get_unstructured_api_key",
|
||||
return_value=None,
|
||||
)
|
||||
def test_highspot_connector_poll_source(
|
||||
mock_get_api_key: MagicMock, highspot_connector: HighspotConnector
|
||||
) -> None:
|
||||
"""Test poll_source functionality with date range filtering."""
|
||||
# Define date range: April 3, 2025 to April 4, 2025
|
||||
start_date = datetime(2025, 4, 3, 0, 0, 0)
|
||||
end_date = datetime(2025, 4, 4, 23, 59, 59)
|
||||
|
||||
# Convert to seconds since Unix epoch
|
||||
start_time = int(time.mktime(start_date.timetuple()))
|
||||
end_time = int(time.mktime(end_date.timetuple()))
|
||||
|
||||
# Load test data for assertions
|
||||
test_data = load_test_data()
|
||||
poll_source_data = test_data.get("poll_source", {})
|
||||
target_doc_id = poll_source_data.get("target_doc_id")
|
||||
|
||||
# Call poll_source with date range
|
||||
all_docs: list[Document] = []
|
||||
target_doc: Document | None = None
|
||||
|
||||
for doc_batch in highspot_connector.poll_source(start_time, end_time):
|
||||
for doc in doc_batch:
|
||||
all_docs.append(doc)
|
||||
if doc.id == f"HIGHSPOT_{target_doc_id}":
|
||||
target_doc = doc
|
||||
|
||||
# Verify documents were loaded
|
||||
assert len(all_docs) > 0
|
||||
|
||||
# Verify the specific test document was found and has correct properties
|
||||
assert target_doc is not None
|
||||
assert target_doc.semantic_identifier == poll_source_data.get("semantic_identifier")
|
||||
assert target_doc.source == DocumentSource.HIGHSPOT
|
||||
assert target_doc.metadata is not None
|
||||
|
||||
# Verify sections
|
||||
assert len(target_doc.sections) == 1
|
||||
section = target_doc.sections[0]
|
||||
assert section.link == poll_source_data.get("link")
|
||||
assert section.text is not None
|
||||
assert len(section.text) > 0
|
||||
|
||||
|
||||
def test_highspot_connector_validate_credentials(
|
||||
highspot_connector: HighspotConnector,
|
||||
) -> None:
|
||||
|
||||
@@ -1,5 +1,10 @@
|
||||
{
|
||||
"target_doc_id": "67cd8eb35d3ee0487de2e704",
|
||||
"semantic_identifier": "Highspot in Action _ Salesforce Integration",
|
||||
"link": "https://www.highspot.com/items/67cd8eb35d3ee0487de2e704"
|
||||
"link": "https://www.highspot.com/items/67cd8eb35d3ee0487de2e704",
|
||||
"poll_source": {
|
||||
"target_doc_id":"67ef9edcc3f40b2bf3d816a8",
|
||||
"semantic_identifier":"A Brief Introduction To AI",
|
||||
"link":"https://www.highspot.com/items/67ef9edcc3f40b2bf3d816a8"
|
||||
}
|
||||
}
|
||||
|
||||
@@ -35,23 +35,22 @@ def salesforce_connector() -> SalesforceConnector:
|
||||
connector = SalesforceConnector(
|
||||
requested_objects=["Account", "Contact", "Opportunity"],
|
||||
)
|
||||
|
||||
username = os.environ["SF_USERNAME"]
|
||||
password = os.environ["SF_PASSWORD"]
|
||||
security_token = os.environ["SF_SECURITY_TOKEN"]
|
||||
|
||||
connector.load_credentials(
|
||||
{
|
||||
"sf_username": os.environ["SF_USERNAME"],
|
||||
"sf_password": os.environ["SF_PASSWORD"],
|
||||
"sf_security_token": os.environ["SF_SECURITY_TOKEN"],
|
||||
"sf_username": username,
|
||||
"sf_password": password,
|
||||
"sf_security_token": security_token,
|
||||
}
|
||||
)
|
||||
return connector
|
||||
|
||||
|
||||
# TODO: make the credentials not expire
|
||||
@pytest.mark.xfail(
|
||||
reason=(
|
||||
"Credentials change over time, so this test will fail if run when "
|
||||
"the credentials expire."
|
||||
)
|
||||
)
|
||||
def test_salesforce_connector_basic(salesforce_connector: SalesforceConnector) -> None:
|
||||
test_data = load_test_data()
|
||||
target_test_doc: Document | None = None
|
||||
@@ -61,21 +60,26 @@ def test_salesforce_connector_basic(salesforce_connector: SalesforceConnector) -
|
||||
all_docs.append(doc)
|
||||
if doc.id == test_data["id"]:
|
||||
target_test_doc = doc
|
||||
break
|
||||
|
||||
# The number of docs here seems to change actively so do a very loose check
|
||||
# as of 2025-03-28 it was around 32472
|
||||
assert len(all_docs) > 32000
|
||||
assert len(all_docs) < 40000
|
||||
|
||||
assert len(all_docs) == 6
|
||||
assert target_test_doc is not None
|
||||
|
||||
# Set of received links
|
||||
received_links: set[str] = set()
|
||||
# List of received text fields, which contain key-value pairs seperated by newlines
|
||||
recieved_text: list[str] = []
|
||||
received_text: list[str] = []
|
||||
|
||||
# Iterate over the sections of the target test doc to extract the links and text
|
||||
for section in target_test_doc.sections:
|
||||
assert section.link
|
||||
assert section.text
|
||||
received_links.add(section.link)
|
||||
recieved_text.append(section.text)
|
||||
received_text.append(section.text)
|
||||
|
||||
# Check that the received links match the expected links from the test data json
|
||||
expected_links = set(test_data["expected_links"])
|
||||
@@ -85,8 +89,9 @@ def test_salesforce_connector_basic(salesforce_connector: SalesforceConnector) -
|
||||
expected_text = test_data["expected_text"]
|
||||
if not isinstance(expected_text, list):
|
||||
raise ValueError("Expected text is not a list")
|
||||
|
||||
unparsed_expected_key_value_pairs: list[str] = expected_text
|
||||
received_key_value_pairs = extract_key_value_pairs_to_set(recieved_text)
|
||||
received_key_value_pairs = extract_key_value_pairs_to_set(received_text)
|
||||
expected_key_value_pairs = extract_key_value_pairs_to_set(
|
||||
unparsed_expected_key_value_pairs
|
||||
)
|
||||
@@ -96,13 +101,21 @@ def test_salesforce_connector_basic(salesforce_connector: SalesforceConnector) -
|
||||
assert target_test_doc.source == DocumentSource.SALESFORCE
|
||||
assert target_test_doc.semantic_identifier == test_data["semantic_identifier"]
|
||||
assert target_test_doc.metadata == test_data["metadata"]
|
||||
assert target_test_doc.primary_owners == test_data["primary_owners"]
|
||||
|
||||
assert target_test_doc.primary_owners is not None
|
||||
primary_owner = target_test_doc.primary_owners[0]
|
||||
expected_primary_owner = test_data["primary_owners"]
|
||||
assert isinstance(expected_primary_owner, dict)
|
||||
assert primary_owner.email == expected_primary_owner["email"]
|
||||
assert primary_owner.first_name == expected_primary_owner["first_name"]
|
||||
assert primary_owner.last_name == expected_primary_owner["last_name"]
|
||||
|
||||
assert target_test_doc.secondary_owners == test_data["secondary_owners"]
|
||||
assert target_test_doc.title == test_data["title"]
|
||||
|
||||
|
||||
# TODO: make the credentials not expire
|
||||
@pytest.mark.xfail(
|
||||
@pytest.mark.skip(
|
||||
reason=(
|
||||
"Credentials change over time, so this test will fail if run when "
|
||||
"the credentials expire."
|
||||
|
||||
@@ -1,20 +1,162 @@
|
||||
{
|
||||
"id": "SALESFORCE_001fI000005drUcQAI",
|
||||
"id": "SALESFORCE_001bm00000eu6n5AAA",
|
||||
"expected_links": [
|
||||
"https://customization-ruby-2195.my.salesforce.com/001fI000005drUcQAI",
|
||||
"https://customization-ruby-2195.my.salesforce.com/003fI000001jiCPQAY",
|
||||
"https://customization-ruby-2195.my.salesforce.com/017fI00000T7hvsQAB",
|
||||
"https://customization-ruby-2195.my.salesforce.com/006fI000000rDvBQAU"
|
||||
"https://danswer-dev-ed.develop.my.salesforce.com/003bm00000ESpEeAAL",
|
||||
"https://danswer-dev-ed.develop.my.salesforce.com/003bm00000ESqd3AAD",
|
||||
"https://danswer-dev-ed.develop.my.salesforce.com/003bm00000ESoKiAAL",
|
||||
"https://danswer-dev-ed.develop.my.salesforce.com/003bm00000ESvDSAA1",
|
||||
"https://danswer-dev-ed.develop.my.salesforce.com/003bm00000ESrmHAAT",
|
||||
"https://danswer-dev-ed.develop.my.salesforce.com/003bm00000ESrl2AAD",
|
||||
"https://danswer-dev-ed.develop.my.salesforce.com/003bm00000ESvejAAD",
|
||||
"https://danswer-dev-ed.develop.my.salesforce.com/003bm00000EStlvAAD",
|
||||
"https://danswer-dev-ed.develop.my.salesforce.com/003bm00000ESpPfAAL",
|
||||
"https://danswer-dev-ed.develop.my.salesforce.com/003bm00000ESrP9AAL",
|
||||
"https://danswer-dev-ed.develop.my.salesforce.com/003bm00000ESvlMAAT",
|
||||
"https://danswer-dev-ed.develop.my.salesforce.com/003bm00000ESt3JAAT",
|
||||
"https://danswer-dev-ed.develop.my.salesforce.com/003bm00000ESoBkAAL",
|
||||
"https://danswer-dev-ed.develop.my.salesforce.com/003bm00000EStw2AAD",
|
||||
"https://danswer-dev-ed.develop.my.salesforce.com/003bm00000ESrkMAAT",
|
||||
"https://danswer-dev-ed.develop.my.salesforce.com/003bm00000ESojKAAT",
|
||||
"https://danswer-dev-ed.develop.my.salesforce.com/003bm00000ESuLEAA1",
|
||||
"https://danswer-dev-ed.develop.my.salesforce.com/003bm00000ESoSIAA1",
|
||||
"https://danswer-dev-ed.develop.my.salesforce.com/003bm00000ESu2YAAT",
|
||||
"https://danswer-dev-ed.develop.my.salesforce.com/003bm00000ESvgSAAT",
|
||||
"https://danswer-dev-ed.develop.my.salesforce.com/003bm00000ESurnAAD",
|
||||
"https://danswer-dev-ed.develop.my.salesforce.com/003bm00000ESrnqAAD",
|
||||
"https://danswer-dev-ed.develop.my.salesforce.com/003bm00000ESoB5AAL",
|
||||
"https://danswer-dev-ed.develop.my.salesforce.com/003bm00000ESuJuAAL",
|
||||
"https://danswer-dev-ed.develop.my.salesforce.com/003bm00000ESrfyAAD",
|
||||
"https://danswer-dev-ed.develop.my.salesforce.com/001bm00000eu6n5AAA",
|
||||
"https://danswer-dev-ed.develop.my.salesforce.com/003bm00000ESpUHAA1",
|
||||
"https://danswer-dev-ed.develop.my.salesforce.com/003bm00000ESsgGAAT",
|
||||
"https://danswer-dev-ed.develop.my.salesforce.com/003bm00000ESr7UAAT",
|
||||
"https://danswer-dev-ed.develop.my.salesforce.com/003bm00000ESu1BAAT",
|
||||
"https://danswer-dev-ed.develop.my.salesforce.com/003bm00000ESpqzAAD",
|
||||
"https://danswer-dev-ed.develop.my.salesforce.com/003bm00000ESplZAAT",
|
||||
"https://danswer-dev-ed.develop.my.salesforce.com/003bm00000ESvJ3AAL",
|
||||
"https://danswer-dev-ed.develop.my.salesforce.com/003bm00000ESurKAAT",
|
||||
"https://danswer-dev-ed.develop.my.salesforce.com/003bm00000EStSiAAL",
|
||||
"https://danswer-dev-ed.develop.my.salesforce.com/003bm00000ESuJFAA1",
|
||||
"https://danswer-dev-ed.develop.my.salesforce.com/003bm00000ESu8xAAD",
|
||||
"https://danswer-dev-ed.develop.my.salesforce.com/003bm00000ESqfzAAD",
|
||||
"https://danswer-dev-ed.develop.my.salesforce.com/003bm00000ESqsrAAD",
|
||||
"https://danswer-dev-ed.develop.my.salesforce.com/003bm00000EStoZAAT",
|
||||
"https://danswer-dev-ed.develop.my.salesforce.com/003bm00000ESsIUAA1",
|
||||
"https://danswer-dev-ed.develop.my.salesforce.com/003bm00000ESsAGAA1",
|
||||
"https://danswer-dev-ed.develop.my.salesforce.com/003bm00000ESv8GAAT",
|
||||
"https://danswer-dev-ed.develop.my.salesforce.com/003bm00000ESrOKAA1",
|
||||
"https://danswer-dev-ed.develop.my.salesforce.com/003bm00000ESoUmAAL",
|
||||
"https://danswer-dev-ed.develop.my.salesforce.com/003bm00000ESudKAAT",
|
||||
"https://danswer-dev-ed.develop.my.salesforce.com/003bm00000ESuJ8AAL",
|
||||
"https://danswer-dev-ed.develop.my.salesforce.com/003bm00000ESvf2AAD",
|
||||
"https://danswer-dev-ed.develop.my.salesforce.com/003bm00000ESw3qAAD",
|
||||
"https://danswer-dev-ed.develop.my.salesforce.com/003bm00000ESugRAAT",
|
||||
"https://danswer-dev-ed.develop.my.salesforce.com/003bm00000ESr18AAD",
|
||||
"https://danswer-dev-ed.develop.my.salesforce.com/003bm00000ESqV1AAL",
|
||||
"https://danswer-dev-ed.develop.my.salesforce.com/003bm00000ESuLVAA1",
|
||||
"https://danswer-dev-ed.develop.my.salesforce.com/003bm00000ESpjoAAD",
|
||||
"https://danswer-dev-ed.develop.my.salesforce.com/003bm00000ESqULAA1",
|
||||
"https://danswer-dev-ed.develop.my.salesforce.com/003bm00000ESuCAAA1",
|
||||
"https://danswer-dev-ed.develop.my.salesforce.com/003bm00000ESrfpAAD",
|
||||
"https://danswer-dev-ed.develop.my.salesforce.com/003bm00000ESp5YAAT",
|
||||
"https://danswer-dev-ed.develop.my.salesforce.com/003bm00000ESrMNAA1",
|
||||
"https://danswer-dev-ed.develop.my.salesforce.com/003bm00000EStaUAAT",
|
||||
"https://danswer-dev-ed.develop.my.salesforce.com/003bm00000ESt5LAAT",
|
||||
"https://danswer-dev-ed.develop.my.salesforce.com/003bm00000ESrtcAAD",
|
||||
"https://danswer-dev-ed.develop.my.salesforce.com/003bm00000ESomaAAD",
|
||||
"https://danswer-dev-ed.develop.my.salesforce.com/003bm00000ESrtIAAT",
|
||||
"https://danswer-dev-ed.develop.my.salesforce.com/003bm00000ESoToAAL",
|
||||
"https://danswer-dev-ed.develop.my.salesforce.com/003bm00000ESuWLAA1",
|
||||
"https://danswer-dev-ed.develop.my.salesforce.com/003bm00000ESrWvAAL",
|
||||
"https://danswer-dev-ed.develop.my.salesforce.com/003bm00000ESsJEAA1",
|
||||
"https://danswer-dev-ed.develop.my.salesforce.com/003bm00000ESsxwAAD",
|
||||
"https://danswer-dev-ed.develop.my.salesforce.com/003bm00000ESvUgAAL",
|
||||
"https://danswer-dev-ed.develop.my.salesforce.com/003bm00000ESvWjAAL",
|
||||
"https://danswer-dev-ed.develop.my.salesforce.com/003bm00000EStBuAAL",
|
||||
"https://danswer-dev-ed.develop.my.salesforce.com/003bm00000ESpZiAAL",
|
||||
"https://danswer-dev-ed.develop.my.salesforce.com/003bm00000ESuhYAAT",
|
||||
"https://danswer-dev-ed.develop.my.salesforce.com/003bm00000ESuWAAA1"
|
||||
],
|
||||
"expected_text": [
|
||||
"BillingPostalCode: 60601\nType: Prospect\nWebsite: www.globalistindustries.com\nBillingCity: Chicago\nDescription: Globalist company\nIsDeleted: false\nIsPartner: false\nPhone: (312) 555-0456\nShippingCountry: USA\nShippingState: IL\nIsBuyer: false\nBillingCountry: USA\nBillingState: IL\nShippingPostalCode: 60601\nBillingStreet: 456 Market St\nIsCustomerPortal: false\nPersonActiveTrackerCount: 0\nShippingCity: Chicago\nShippingStreet: 456 Market St",
|
||||
"FirstName: Michael\nMailingCountry: USA\nActiveTrackerCount: 0\nEmail: m.brown@globalindustries.com\nMailingState: IL\nMailingStreet: 456 Market St\nMailingCity: Chicago\nLastName: Brown\nTitle: CTO\nIsDeleted: false\nPhone: (312) 555-0456\nHasOptedOutOfEmail: false\nIsEmailBounced: false\nMailingPostalCode: 60601",
|
||||
"ForecastCategory: Closed\nName: Global Industries Equipment Sale\nIsDeleted: false\nForecastCategoryName: Closed\nFiscalYear: 2024\nFiscalQuarter: 4\nIsClosed: true\nIsWon: true\nAmount: 5000000.0\nProbability: 100.0\nPushCount: 0\nHasOverdueTask: false\nStageName: Closed Won\nHasOpenActivity: false\nHasOpportunityLineItem: false",
|
||||
"Field: created\nDataType: Text\nIsDeleted: false"
|
||||
"IsDeleted: false\nBillingCity: Shaykh al \u00e1\u00b8\u00a8ad\u00c4\u00abd\nName: Voonder\nCleanStatus: Pending\nBillingStreet: 12 Cambridge Parkway",
|
||||
"Email: eslayqzs@icio.us\nIsDeleted: false\nLastName: Slay\nIsEmailBounced: false\nFirstName: Ebeneser\nIsPriorityRecord: false\nCleanStatus: Pending",
|
||||
"Email: ptweedgdh@umich.edu\nIsDeleted: false\nLastName: Tweed\nIsEmailBounced: false\nFirstName: Paulita\nIsPriorityRecord: false\nCleanStatus: Pending",
|
||||
"Email: ehurnellnlx@facebook.com\nIsDeleted: false\nLastName: Hurnell\nIsEmailBounced: false\nFirstName: Eliot\nIsPriorityRecord: false\nCleanStatus: Pending",
|
||||
"Email: ccarik4q4@google.it\nIsDeleted: false\nLastName: Carik\nIsEmailBounced: false\nFirstName: Chadwick\nIsPriorityRecord: false\nCleanStatus: Pending",
|
||||
"Email: cvannozziina6@moonfruit.com\nIsDeleted: false\nLastName: Vannozzii\nIsEmailBounced: false\nFirstName: Christophorus\nIsPriorityRecord: false\nCleanStatus: Pending",
|
||||
"Email: mikringill2kz@hugedomains.com\nIsDeleted: false\nLastName: Ikringill\nIsEmailBounced: false\nFirstName: Meghann\nIsPriorityRecord: false\nCleanStatus: Pending",
|
||||
"Email: bgrinvalray@fda.gov\nIsDeleted: false\nLastName: Grinval\nIsEmailBounced: false\nFirstName: Berti\nIsPriorityRecord: false\nCleanStatus: Pending",
|
||||
"Email: aollanderhr7@cam.ac.uk\nIsDeleted: false\nLastName: Ollander\nIsEmailBounced: false\nFirstName: Annemarie\nIsPriorityRecord: false\nCleanStatus: Pending",
|
||||
"Email: rwhitesideq38@gravatar.com\nIsDeleted: false\nLastName: Whiteside\nIsEmailBounced: false\nFirstName: Rolando\nIsPriorityRecord: false\nCleanStatus: Pending",
|
||||
"Email: vkrafthmz@techcrunch.com\nIsDeleted: false\nLastName: Kraft\nIsEmailBounced: false\nFirstName: Vidovik\nIsPriorityRecord: false\nCleanStatus: Pending",
|
||||
"Email: jhillaut@4shared.com\nIsDeleted: false\nLastName: Hill\nIsEmailBounced: false\nFirstName: Janel\nIsPriorityRecord: false\nCleanStatus: Pending",
|
||||
"Email: lralstonycs@discovery.com\nIsDeleted: false\nLastName: Ralston\nIsEmailBounced: false\nFirstName: Lorrayne\nIsPriorityRecord: false\nCleanStatus: Pending",
|
||||
"Email: blyttlewba@networkadvertising.org\nIsDeleted: false\nLastName: Lyttle\nIsEmailBounced: false\nFirstName: Ban\nIsPriorityRecord: false\nCleanStatus: Pending",
|
||||
"Email: pplummernvf@technorati.com\nIsDeleted: false\nLastName: Plummer\nIsEmailBounced: false\nFirstName: Pete\nIsPriorityRecord: false\nCleanStatus: Pending",
|
||||
"Email: babrahamoffxpb@theatlantic.com\nIsDeleted: false\nLastName: Abrahamoff\nIsEmailBounced: false\nFirstName: Brander\nIsPriorityRecord: false\nCleanStatus: Pending",
|
||||
"Email: ahargieym0@homestead.com\nIsDeleted: false\nLastName: Hargie\nIsEmailBounced: false\nFirstName: Aili\nIsPriorityRecord: false\nCleanStatus: Pending",
|
||||
"Email: hstotthp2@yelp.com\nIsDeleted: false\nLastName: Stott\nIsEmailBounced: false\nFirstName: Hartley\nIsPriorityRecord: false\nCleanStatus: Pending",
|
||||
"Email: jganniclifftuvj@blinklist.com\nIsDeleted: false\nLastName: Ganniclifft\nIsEmailBounced: false\nFirstName: Jamima\nIsPriorityRecord: false\nCleanStatus: Pending",
|
||||
"Email: ldodelly8q@ed.gov\nIsDeleted: false\nLastName: Dodell\nIsEmailBounced: false\nFirstName: Lynde\nIsPriorityRecord: false\nCleanStatus: Pending",
|
||||
"Email: rmilner3cp@smh.com.au\nIsDeleted: false\nLastName: Milner\nIsEmailBounced: false\nFirstName: Ralph\nIsPriorityRecord: false\nCleanStatus: Pending",
|
||||
"Email: gghiriardellic19@state.tx.us\nIsDeleted: false\nLastName: Ghiriardelli\nIsEmailBounced: false\nFirstName: Garv\nIsPriorityRecord: false\nCleanStatus: Pending",
|
||||
"Email: rhubatschfpu@nature.com\nIsDeleted: false\nLastName: Hubatsch\nIsEmailBounced: false\nFirstName: Rose\nIsPriorityRecord: false\nCleanStatus: Pending",
|
||||
"Email: mtrenholme1ws@quantcast.com\nIsDeleted: false\nLastName: Trenholme\nIsEmailBounced: false\nFirstName: Mariejeanne\nIsPriorityRecord: false\nCleanStatus: Pending",
|
||||
"Email: jmussettpbd@over-blog.com\nIsDeleted: false\nLastName: Mussett\nIsEmailBounced: false\nFirstName: Juliann\nIsPriorityRecord: false\nCleanStatus: Pending",
|
||||
"Email: bgoroni145@illinois.edu\nIsDeleted: false\nLastName: Goroni\nIsEmailBounced: false\nFirstName: Bernarr\nIsPriorityRecord: false\nCleanStatus: Pending",
|
||||
"Email: afalls3ph@theguardian.com\nIsDeleted: false\nLastName: Falls\nIsEmailBounced: false\nFirstName: Angelia\nIsPriorityRecord: false\nCleanStatus: Pending",
|
||||
"Email: lswettjoi@go.com\nIsDeleted: false\nLastName: Swett\nIsEmailBounced: false\nFirstName: Levon\nIsPriorityRecord: false\nCleanStatus: Pending",
|
||||
"Email: emullinsz38@dailymotion.com\nIsDeleted: false\nLastName: Mullins\nIsEmailBounced: false\nFirstName: Elsa\nIsPriorityRecord: false\nCleanStatus: Pending",
|
||||
"Email: ibernettehco@ebay.co.uk\nIsDeleted: false\nLastName: Bernette\nIsEmailBounced: false\nFirstName: Ingrid\nIsPriorityRecord: false\nCleanStatus: Pending",
|
||||
"Email: trisleybtt@simplemachines.org\nIsDeleted: false\nLastName: Risley\nIsEmailBounced: false\nFirstName: Toma\nIsPriorityRecord: false\nCleanStatus: Pending",
|
||||
"Email: rgypsonqx1@goodreads.com\nIsDeleted: false\nLastName: Gypson\nIsEmailBounced: false\nFirstName: Reed\nIsPriorityRecord: false\nCleanStatus: Pending",
|
||||
"Email: cposvneri28@jiathis.com\nIsDeleted: false\nLastName: Posvner\nIsEmailBounced: false\nFirstName: Culley\nIsPriorityRecord: false\nCleanStatus: Pending",
|
||||
"Email: awilmut2rz@geocities.jp\nIsDeleted: false\nLastName: Wilmut\nIsEmailBounced: false\nFirstName: Andy\nIsPriorityRecord: false\nCleanStatus: Pending",
|
||||
"Email: aluckwellra5@exblog.jp\nIsDeleted: false\nLastName: Luckwell\nIsEmailBounced: false\nFirstName: Andreana\nIsPriorityRecord: false\nCleanStatus: Pending",
|
||||
"Email: irollings26j@timesonline.co.uk\nIsDeleted: false\nLastName: Rollings\nIsEmailBounced: false\nFirstName: Ibrahim\nIsPriorityRecord: false\nCleanStatus: Pending",
|
||||
"Email: gspireqpd@g.co\nIsDeleted: false\nLastName: Spire\nIsEmailBounced: false\nFirstName: Gaelan\nIsPriorityRecord: false\nCleanStatus: Pending",
|
||||
"Email: sbezleyk2y@acquirethisname.com\nIsDeleted: false\nLastName: Bezley\nIsEmailBounced: false\nFirstName: Sindee\nIsPriorityRecord: false\nCleanStatus: Pending",
|
||||
"Email: icollerrr@flickr.com\nIsDeleted: false\nLastName: Coller\nIsEmailBounced: false\nFirstName: Inesita\nIsPriorityRecord: false\nCleanStatus: Pending",
|
||||
"Email: kfolliott1bo@nature.com\nIsDeleted: false\nLastName: Folliott\nIsEmailBounced: false\nFirstName: Kennan\nIsPriorityRecord: false\nCleanStatus: Pending",
|
||||
"Email: kroofjfo@gnu.org\nIsDeleted: false\nLastName: Roof\nIsEmailBounced: false\nFirstName: Karlik\nIsPriorityRecord: false\nCleanStatus: Pending",
|
||||
"Email: lcovotti8s4@rediff.com\nIsDeleted: false\nLastName: Covotti\nIsEmailBounced: false\nFirstName: Lucho\nIsPriorityRecord: false\nCleanStatus: Pending",
|
||||
"Email: gpatriskson1rs@census.gov\nIsDeleted: false\nLastName: Patriskson\nIsEmailBounced: false\nFirstName: Gardener\nIsPriorityRecord: false\nCleanStatus: Pending",
|
||||
"Email: spidgleyqvw@usgs.gov\nIsDeleted: false\nLastName: Pidgley\nIsEmailBounced: false\nFirstName: Simona\nIsPriorityRecord: false\nCleanStatus: Pending",
|
||||
"Email: cbecarrak0i@over-blog.com\nIsDeleted: false\nLastName: Becarra\nIsEmailBounced: false\nFirstName: Cally\nIsPriorityRecord: false\nCleanStatus: Pending",
|
||||
"Email: aparkman9td@bbc.co.uk\nIsDeleted: false\nLastName: Parkman\nIsEmailBounced: false\nFirstName: Agneta\nIsPriorityRecord: false\nCleanStatus: Pending",
|
||||
"Email: bboddingtonhn@quantcast.com\nIsDeleted: false\nLastName: Boddington\nIsEmailBounced: false\nFirstName: Betta\nIsPriorityRecord: false\nCleanStatus: Pending",
|
||||
"Email: dcasementx0p@cafepress.com\nIsDeleted: false\nLastName: Casement\nIsEmailBounced: false\nFirstName: Dannie\nIsPriorityRecord: false\nCleanStatus: Pending",
|
||||
"Email: hzornbhe@latimes.com\nIsDeleted: false\nLastName: Zorn\nIsEmailBounced: false\nFirstName: Haleigh\nIsPriorityRecord: false\nCleanStatus: Pending",
|
||||
"Email: cfifieldbjb@blogspot.com\nIsDeleted: false\nLastName: Fifield\nIsEmailBounced: false\nFirstName: Christalle\nIsPriorityRecord: false\nCleanStatus: Pending",
|
||||
"Email: ddewerson4t3@skype.com\nIsDeleted: false\nLastName: Dewerson\nIsEmailBounced: false\nFirstName: Dyann\nIsPriorityRecord: false\nCleanStatus: Pending",
|
||||
"Email: khullock52p@sohu.com\nIsDeleted: false\nLastName: Hullock\nIsEmailBounced: false\nFirstName: Kellina\nIsPriorityRecord: false\nCleanStatus: Pending",
|
||||
"Email: tfremantle32n@bandcamp.com\nIsDeleted: false\nLastName: Fremantle\nIsEmailBounced: false\nFirstName: Turner\nIsPriorityRecord: false\nCleanStatus: Pending",
|
||||
"Email: sbernardtylp@nps.gov\nIsDeleted: false\nLastName: Bernardt\nIsEmailBounced: false\nFirstName: Selina\nIsPriorityRecord: false\nCleanStatus: Pending",
|
||||
"Email: smcgettigan8kk@slideshare.net\nIsDeleted: false\nLastName: McGettigan\nIsEmailBounced: false\nFirstName: Sada\nIsPriorityRecord: false\nCleanStatus: Pending",
|
||||
"Email: wdelafontvgn@businesswire.com\nIsDeleted: false\nLastName: Delafont\nIsEmailBounced: false\nFirstName: West\nIsPriorityRecord: false\nCleanStatus: Pending",
|
||||
"Email: lbelsher9ne@indiatimes.com\nIsDeleted: false\nLastName: Belsher\nIsEmailBounced: false\nFirstName: Lou\nIsPriorityRecord: false\nCleanStatus: Pending",
|
||||
"Email: cgoody27y@blogtalkradio.com\nIsDeleted: false\nLastName: Goody\nIsEmailBounced: false\nFirstName: Colene\nIsPriorityRecord: false\nCleanStatus: Pending",
|
||||
"Email: cstodejzz@ucoz.ru\nIsDeleted: false\nLastName: Stode\nIsEmailBounced: false\nFirstName: Curcio\nIsPriorityRecord: false\nCleanStatus: Pending",
|
||||
"Email: abromidgejb@china.com.cn\nIsDeleted: false\nLastName: Bromidge\nIsEmailBounced: false\nFirstName: Ariela\nIsPriorityRecord: false\nCleanStatus: Pending",
|
||||
"Email: ldelgardilloqvp@xrea.com\nIsDeleted: false\nLastName: Delgardillo\nIsEmailBounced: false\nFirstName: Lauralee\nIsPriorityRecord: false\nCleanStatus: Pending",
|
||||
"Email: dcroal9t4@businessinsider.com\nIsDeleted: false\nLastName: Croal\nIsEmailBounced: false\nFirstName: Devlin\nIsPriorityRecord: false\nCleanStatus: Pending",
|
||||
"Email: dclarageqzb@wordpress.com\nIsDeleted: false\nLastName: Clarage\nIsEmailBounced: false\nFirstName: Dre\nIsPriorityRecord: false\nCleanStatus: Pending",
|
||||
"Email: dthirlwall3jf@taobao.com\nIsDeleted: false\nLastName: Thirlwall\nIsEmailBounced: false\nFirstName: Dareen\nIsPriorityRecord: false\nCleanStatus: Pending",
|
||||
"Email: tkeddie2lj@wiley.com\nIsDeleted: false\nLastName: Keddie\nIsEmailBounced: false\nFirstName: Tandi\nIsPriorityRecord: false\nCleanStatus: Pending",
|
||||
"Email: jrimingtoni3i@istockphoto.com\nIsDeleted: false\nLastName: Rimington\nIsEmailBounced: false\nFirstName: Judy\nIsPriorityRecord: false\nCleanStatus: Pending",
|
||||
"Email: gtroynet@slashdot.org\nIsDeleted: false\nLastName: Troy\nIsEmailBounced: false\nFirstName: Gail\nIsPriorityRecord: false\nCleanStatus: Pending",
|
||||
"Email: ebunneyh0n@meetup.com\nIsDeleted: false\nLastName: Bunney\nIsEmailBounced: false\nFirstName: Efren\nIsPriorityRecord: false\nCleanStatus: Pending",
|
||||
"Email: yhaken8p3@slate.com\nIsDeleted: false\nLastName: Haken\nIsEmailBounced: false\nFirstName: Yard\nIsPriorityRecord: false\nCleanStatus: Pending",
|
||||
"Email: nolliffeq6q@biblegateway.com\nIsDeleted: false\nLastName: Olliffe\nIsEmailBounced: false\nFirstName: Nani\nIsPriorityRecord: false\nCleanStatus: Pending",
|
||||
"Email: bgalia9jz@odnoklassniki.ru\nIsDeleted: false\nLastName: Galia\nIsEmailBounced: false\nFirstName: Berrie\nIsPriorityRecord: false\nCleanStatus: Pending",
|
||||
"Email: djedrzej3v1@google.com\nIsDeleted: false\nLastName: Jedrzej\nIsEmailBounced: false\nFirstName: Deanne\nIsPriorityRecord: false\nCleanStatus: Pending",
|
||||
"Email: mcamiesh1t@fc2.com\nIsDeleted: false\nLastName: Camies\nIsEmailBounced: false\nFirstName: Mikaela\nIsPriorityRecord: false\nCleanStatus: Pending",
|
||||
"Email: csunshineqni@state.tx.us\nIsDeleted: false\nLastName: Sunshine\nIsEmailBounced: false\nFirstName: Curtis\nIsPriorityRecord: false\nCleanStatus: Pending",
|
||||
"Email: fiannellib46@marriott.com\nIsDeleted: false\nLastName: Iannelli\nIsEmailBounced: false\nFirstName: Felicio\nIsPriorityRecord: false\nCleanStatus: Pending"
|
||||
],
|
||||
"semantic_identifier": "Unknown Object",
|
||||
"semantic_identifier": "Voonder",
|
||||
"metadata": {},
|
||||
"primary_owners": null,
|
||||
"primary_owners": {"email": "hagen@danswer.ai", "first_name": "Hagen", "last_name": "oneill"},
|
||||
"secondary_owners": null,
|
||||
"title": null
|
||||
}
|
||||
|
||||
@@ -444,6 +444,7 @@ class CCPairManager:
|
||||
)
|
||||
if group_sync_result.status_code != 409:
|
||||
group_sync_result.raise_for_status()
|
||||
time.sleep(2)
|
||||
|
||||
@staticmethod
|
||||
def get_doc_sync_task(
|
||||
|
||||
@@ -165,17 +165,18 @@ class DocumentManager:
|
||||
doc["fields"]["document_id"]: doc["fields"] for doc in retrieved_docs_dict
|
||||
}
|
||||
|
||||
# NOTE(rkuo): too much log spam
|
||||
# Left this here for debugging purposes.
|
||||
import json
|
||||
# import json
|
||||
|
||||
print("DEBUGGING DOCUMENTS")
|
||||
print(retrieved_docs)
|
||||
for doc in retrieved_docs.values():
|
||||
printable_doc = doc.copy()
|
||||
print(printable_doc.keys())
|
||||
printable_doc.pop("embeddings")
|
||||
printable_doc.pop("title_embedding")
|
||||
print(json.dumps(printable_doc, indent=2))
|
||||
# print("DEBUGGING DOCUMENTS")
|
||||
# print(retrieved_docs)
|
||||
# for doc in retrieved_docs.values():
|
||||
# printable_doc = doc.copy()
|
||||
# print(printable_doc.keys())
|
||||
# printable_doc.pop("embeddings")
|
||||
# printable_doc.pop("title_embedding")
|
||||
# print(json.dumps(printable_doc, indent=2))
|
||||
|
||||
for document in cc_pair.documents:
|
||||
retrieved_doc = retrieved_docs.get(document.id)
|
||||
|
||||
@@ -1,3 +1,4 @@
|
||||
import time
|
||||
from datetime import datetime
|
||||
from datetime import timedelta
|
||||
from urllib.parse import urlencode
|
||||
@@ -191,7 +192,7 @@ class IndexAttemptManager:
|
||||
user_performing_action: DATestUser | None = None,
|
||||
) -> None:
|
||||
"""Wait for an IndexAttempt to complete"""
|
||||
start = datetime.now()
|
||||
start = time.monotonic()
|
||||
while True:
|
||||
index_attempt = IndexAttemptManager.get_index_attempt_by_id(
|
||||
index_attempt_id=index_attempt_id,
|
||||
@@ -203,7 +204,7 @@ class IndexAttemptManager:
|
||||
print(f"IndexAttempt {index_attempt_id} completed")
|
||||
return
|
||||
|
||||
elapsed = (datetime.now() - start).total_seconds()
|
||||
elapsed = time.monotonic() - start
|
||||
if elapsed > timeout:
|
||||
raise TimeoutError(
|
||||
f"IndexAttempt {index_attempt_id} did not complete within {timeout} seconds"
|
||||
|
||||
@@ -4,7 +4,7 @@ from uuid import uuid4
|
||||
import requests
|
||||
|
||||
from onyx.context.search.enums import RecencyBiasSetting
|
||||
from onyx.server.features.persona.models import PersonaSnapshot
|
||||
from onyx.server.features.persona.models import FullPersonaSnapshot
|
||||
from onyx.server.features.persona.models import PersonaUpsertRequest
|
||||
from tests.integration.common_utils.constants import API_SERVER_URL
|
||||
from tests.integration.common_utils.constants import GENERAL_HEADERS
|
||||
@@ -181,7 +181,7 @@ class PersonaManager:
|
||||
@staticmethod
|
||||
def get_all(
|
||||
user_performing_action: DATestUser | None = None,
|
||||
) -> list[PersonaSnapshot]:
|
||||
) -> list[FullPersonaSnapshot]:
|
||||
response = requests.get(
|
||||
f"{API_SERVER_URL}/admin/persona",
|
||||
headers=user_performing_action.headers
|
||||
@@ -189,13 +189,13 @@ class PersonaManager:
|
||||
else GENERAL_HEADERS,
|
||||
)
|
||||
response.raise_for_status()
|
||||
return [PersonaSnapshot(**persona) for persona in response.json()]
|
||||
return [FullPersonaSnapshot(**persona) for persona in response.json()]
|
||||
|
||||
@staticmethod
|
||||
def get_one(
|
||||
persona_id: int,
|
||||
user_performing_action: DATestUser | None = None,
|
||||
) -> list[PersonaSnapshot]:
|
||||
) -> list[FullPersonaSnapshot]:
|
||||
response = requests.get(
|
||||
f"{API_SERVER_URL}/persona/{persona_id}",
|
||||
headers=user_performing_action.headers
|
||||
@@ -203,7 +203,7 @@ class PersonaManager:
|
||||
else GENERAL_HEADERS,
|
||||
)
|
||||
response.raise_for_status()
|
||||
return [PersonaSnapshot(**response.json())]
|
||||
return [FullPersonaSnapshot(**response.json())]
|
||||
|
||||
@staticmethod
|
||||
def verify(
|
||||
|
||||
@@ -313,29 +313,3 @@ class UserManager:
|
||||
)
|
||||
response.raise_for_status()
|
||||
return UserInfo(**response.json())
|
||||
|
||||
@staticmethod
|
||||
def invite_users(
|
||||
user_performing_action: DATestUser,
|
||||
emails: list[str],
|
||||
) -> int:
|
||||
response = requests.put(
|
||||
url=f"{API_SERVER_URL}/manage/admin/users",
|
||||
json={"emails": emails},
|
||||
headers=user_performing_action.headers,
|
||||
)
|
||||
response.raise_for_status()
|
||||
return response.json()
|
||||
|
||||
@staticmethod
|
||||
def remove_invited_user(
|
||||
user_performing_action: DATestUser,
|
||||
user_email: str,
|
||||
) -> int:
|
||||
response = requests.patch(
|
||||
url=f"{API_SERVER_URL}/manage/admin/remove-invited-user",
|
||||
json={"user_email": user_email},
|
||||
headers=user_performing_action.headers,
|
||||
)
|
||||
response.raise_for_status()
|
||||
return response.json()
|
||||
|
||||
@@ -22,7 +22,6 @@ from onyx.document_index.document_index_utils import get_multipass_config
|
||||
from onyx.document_index.vespa.index import DOCUMENT_ID_ENDPOINT
|
||||
from onyx.document_index.vespa.index import VespaIndex
|
||||
from onyx.indexing.models import IndexingSetting
|
||||
from onyx.redis.redis_pool import get_redis_client
|
||||
from onyx.setup import setup_postgres
|
||||
from onyx.setup import setup_vespa
|
||||
from onyx.utils.logger import setup_logger
|
||||
@@ -238,12 +237,6 @@ def reset_vespa() -> None:
|
||||
time.sleep(5)
|
||||
|
||||
|
||||
def reset_redis() -> None:
|
||||
"""Reset the Redis database."""
|
||||
redis_client = get_redis_client()
|
||||
redis_client.flushall()
|
||||
|
||||
|
||||
def reset_postgres_multitenant() -> None:
|
||||
"""Reset the Postgres database for all tenants in a multitenant setup."""
|
||||
|
||||
@@ -348,8 +341,6 @@ def reset_all() -> None:
|
||||
reset_postgres()
|
||||
logger.info("Resetting Vespa...")
|
||||
reset_vespa()
|
||||
logger.info("Resetting Redis...")
|
||||
reset_redis()
|
||||
|
||||
|
||||
def reset_all_multitenant() -> None:
|
||||
|
||||
@@ -14,9 +14,8 @@ from tests.integration.connector_job_tests.slack.slack_api_utils import SlackMan
|
||||
@pytest.fixture()
|
||||
def slack_test_setup() -> Generator[tuple[dict[str, Any], dict[str, Any]], None, None]:
|
||||
slack_client = SlackManager.get_slack_client(os.environ["SLACK_BOT_TOKEN"])
|
||||
admin_user_id = SlackManager.build_slack_user_email_id_map(slack_client)[
|
||||
"admin@onyx-test.com"
|
||||
]
|
||||
user_map = SlackManager.build_slack_user_email_id_map(slack_client)
|
||||
admin_user_id = user_map["admin@onyx-test.com"]
|
||||
|
||||
(
|
||||
public_channel,
|
||||
|
||||
@@ -3,8 +3,6 @@ from datetime import datetime
|
||||
from datetime import timezone
|
||||
from typing import Any
|
||||
|
||||
import pytest
|
||||
|
||||
from onyx.connectors.models import InputType
|
||||
from onyx.db.enums import AccessType
|
||||
from onyx.server.documents.models import DocumentSource
|
||||
@@ -25,7 +23,6 @@ from tests.integration.common_utils.vespa import vespa_fixture
|
||||
from tests.integration.connector_job_tests.slack.slack_api_utils import SlackManager
|
||||
|
||||
|
||||
@pytest.mark.xfail(reason="flaky - see DAN-789 for example", strict=False)
|
||||
def test_slack_permission_sync(
|
||||
reset: None,
|
||||
vespa_client: vespa_fixture,
|
||||
@@ -221,7 +218,6 @@ def test_slack_permission_sync(
|
||||
assert private_message not in onyx_doc_message_strings
|
||||
|
||||
|
||||
@pytest.mark.xfail(reason="flaky", strict=False)
|
||||
def test_slack_group_permission_sync(
|
||||
reset: None,
|
||||
vespa_client: vespa_fixture,
|
||||
|
||||
@@ -1,38 +0,0 @@
|
||||
import pytest
|
||||
from requests import HTTPError
|
||||
|
||||
from onyx.auth.schemas import UserRole
|
||||
from tests.integration.common_utils.managers.user import UserManager
|
||||
from tests.integration.common_utils.test_models import DATestUser
|
||||
|
||||
|
||||
def test_inviting_users_flow(reset: None) -> None:
|
||||
"""
|
||||
Test that verifies the functionality around inviting users:
|
||||
1. Creating an admin user
|
||||
2. Admin inviting a new user
|
||||
3. Invited user successfully signing in
|
||||
4. Non-invited user attempting to sign in (should result in an error)
|
||||
"""
|
||||
# 1) Create an admin user (the first user created is automatically admin)
|
||||
admin_user: DATestUser = UserManager.create(name="admin_user")
|
||||
assert admin_user is not None
|
||||
assert UserManager.is_role(admin_user, UserRole.ADMIN)
|
||||
|
||||
# 2) Admin invites a new user
|
||||
invited_email = "invited_user@test.com"
|
||||
invite_response = UserManager.invite_users(admin_user, [invited_email])
|
||||
|
||||
assert invite_response == 1
|
||||
|
||||
# 3) The invited user successfully registers/logs in
|
||||
invited_user: DATestUser = UserManager.create(
|
||||
name="invited_user", email=invited_email
|
||||
)
|
||||
assert invited_user is not None
|
||||
assert invited_user.email == invited_email
|
||||
assert UserManager.is_role(invited_user, UserRole.BASIC)
|
||||
|
||||
# 4) A non-invited user attempts to sign in/register (should fail)
|
||||
with pytest.raises(HTTPError):
|
||||
UserManager.create(name="uninvited_user", email="uninvited_user@test.com")
|
||||
@@ -5,8 +5,11 @@ from ee.onyx.external_permissions.salesforce.postprocessing import (
|
||||
)
|
||||
from onyx.configs.app_configs import BLURB_SIZE
|
||||
from onyx.configs.constants import DocumentSource
|
||||
from onyx.connectors.salesforce.utils import BASE_DATA_PATH
|
||||
from onyx.context.search.models import InferenceChunk
|
||||
|
||||
SQLITE_DIR = BASE_DATA_PATH
|
||||
|
||||
|
||||
def create_test_chunk(
|
||||
doc_id: str,
|
||||
@@ -39,6 +42,7 @@ def create_test_chunk(
|
||||
|
||||
def test_validate_salesforce_access_single_object() -> None:
|
||||
"""Test filtering when chunk has a single Salesforce object reference"""
|
||||
|
||||
section = "This is a test document about a Salesforce object."
|
||||
test_content = section
|
||||
test_chunk = create_test_chunk(
|
||||
|
||||
@@ -4,6 +4,7 @@ from onyx.chat.prune_and_merge import _merge_sections
|
||||
from onyx.configs.constants import DocumentSource
|
||||
from onyx.context.search.models import InferenceChunk
|
||||
from onyx.context.search.models import InferenceSection
|
||||
from onyx.context.search.utils import inference_section_from_chunks
|
||||
|
||||
|
||||
# This large test accounts for all of the following:
|
||||
@@ -111,7 +112,7 @@ Content 17
|
||||
# Sections
|
||||
[
|
||||
# Document 1, top/middle/bot connected + disconnected section
|
||||
InferenceSection(
|
||||
inference_section_from_chunks(
|
||||
center_chunk=DOC_1_TOP_CHUNK,
|
||||
chunks=[
|
||||
DOC_1_FILLER_1,
|
||||
@@ -120,9 +121,8 @@ Content 17
|
||||
DOC_1_MID_CHUNK,
|
||||
DOC_1_FILLER_3,
|
||||
],
|
||||
combined_content="N/A", # Not used
|
||||
),
|
||||
InferenceSection(
|
||||
inference_section_from_chunks(
|
||||
center_chunk=DOC_1_MID_CHUNK,
|
||||
chunks=[
|
||||
DOC_1_FILLER_2,
|
||||
@@ -131,9 +131,8 @@ Content 17
|
||||
DOC_1_FILLER_3,
|
||||
DOC_1_FILLER_4,
|
||||
],
|
||||
combined_content="N/A",
|
||||
),
|
||||
InferenceSection(
|
||||
inference_section_from_chunks(
|
||||
center_chunk=DOC_1_BOTTOM_CHUNK,
|
||||
chunks=[
|
||||
DOC_1_FILLER_3,
|
||||
@@ -142,9 +141,8 @@ Content 17
|
||||
DOC_1_FILLER_5,
|
||||
DOC_1_FILLER_6,
|
||||
],
|
||||
combined_content="N/A",
|
||||
),
|
||||
InferenceSection(
|
||||
inference_section_from_chunks(
|
||||
center_chunk=DOC_1_DISCONNECTED,
|
||||
chunks=[
|
||||
DOC_1_FILLER_7,
|
||||
@@ -153,9 +151,8 @@ Content 17
|
||||
DOC_1_FILLER_9,
|
||||
DOC_1_FILLER_10,
|
||||
],
|
||||
combined_content="N/A",
|
||||
),
|
||||
InferenceSection(
|
||||
inference_section_from_chunks(
|
||||
center_chunk=DOC_2_TOP_CHUNK,
|
||||
chunks=[
|
||||
DOC_2_FILLER_1,
|
||||
@@ -164,9 +161,8 @@ Content 17
|
||||
DOC_2_FILLER_3,
|
||||
DOC_2_BOTTOM_CHUNK,
|
||||
],
|
||||
combined_content="N/A",
|
||||
),
|
||||
InferenceSection(
|
||||
inference_section_from_chunks(
|
||||
center_chunk=DOC_2_BOTTOM_CHUNK,
|
||||
chunks=[
|
||||
DOC_2_TOP_CHUNK,
|
||||
@@ -175,7 +171,6 @@ Content 17
|
||||
DOC_2_FILLER_4,
|
||||
DOC_2_FILLER_5,
|
||||
],
|
||||
combined_content="N/A",
|
||||
),
|
||||
],
|
||||
# Expected Content
|
||||
@@ -204,15 +199,13 @@ def test_merge_sections(
|
||||
(
|
||||
# Sections
|
||||
[
|
||||
InferenceSection(
|
||||
inference_section_from_chunks(
|
||||
center_chunk=DOC_1_TOP_CHUNK,
|
||||
chunks=[DOC_1_TOP_CHUNK],
|
||||
combined_content="N/A", # Not used
|
||||
),
|
||||
InferenceSection(
|
||||
inference_section_from_chunks(
|
||||
center_chunk=DOC_1_MID_CHUNK,
|
||||
chunks=[DOC_1_MID_CHUNK],
|
||||
combined_content="N/A",
|
||||
),
|
||||
],
|
||||
# Expected Content
|
||||
|
||||
@@ -113,15 +113,18 @@ _VALID_SALESFORCE_IDS = [
|
||||
]
|
||||
|
||||
|
||||
def _clear_sf_db() -> None:
|
||||
def _clear_sf_db(directory: str) -> None:
|
||||
"""
|
||||
Clears the SF DB by deleting all files in the data directory.
|
||||
"""
|
||||
shutil.rmtree(BASE_DATA_PATH, ignore_errors=True)
|
||||
shutil.rmtree(directory, ignore_errors=True)
|
||||
|
||||
|
||||
def _create_csv_file(
|
||||
object_type: str, records: list[dict], filename: str = "test_data.csv"
|
||||
directory: str,
|
||||
object_type: str,
|
||||
records: list[dict],
|
||||
filename: str = "test_data.csv",
|
||||
) -> None:
|
||||
"""
|
||||
Creates a CSV file for the given object type and records.
|
||||
@@ -149,10 +152,10 @@ def _create_csv_file(
|
||||
writer.writerow(record)
|
||||
|
||||
# Update the database with the CSV
|
||||
update_sf_db_with_csv(object_type, csv_path)
|
||||
update_sf_db_with_csv(directory, object_type, csv_path)
|
||||
|
||||
|
||||
def _create_csv_with_example_data() -> None:
|
||||
def _create_csv_with_example_data(directory: str) -> None:
|
||||
"""
|
||||
Creates CSV files with example data, organized by object type.
|
||||
"""
|
||||
@@ -342,10 +345,10 @@ def _create_csv_with_example_data() -> None:
|
||||
|
||||
# Create CSV files for each object type
|
||||
for object_type, records in example_data.items():
|
||||
_create_csv_file(object_type, records)
|
||||
_create_csv_file(directory, object_type, records)
|
||||
|
||||
|
||||
def _test_query() -> None:
|
||||
def _test_query(directory: str) -> None:
|
||||
"""
|
||||
Tests querying functionality by verifying:
|
||||
1. All expected Account IDs are found
|
||||
@@ -401,7 +404,7 @@ def _test_query() -> None:
|
||||
}
|
||||
|
||||
# Get all Account IDs
|
||||
account_ids = find_ids_by_type("Account")
|
||||
account_ids = find_ids_by_type(directory, "Account")
|
||||
|
||||
# Verify we found all expected accounts
|
||||
assert len(account_ids) == len(
|
||||
@@ -413,7 +416,7 @@ def _test_query() -> None:
|
||||
|
||||
# Verify each account's data
|
||||
for acc_id in account_ids:
|
||||
combined = get_record(acc_id)
|
||||
combined = get_record(directory, acc_id)
|
||||
assert combined is not None, f"Could not find account {acc_id}"
|
||||
|
||||
expected = expected_accounts[acc_id]
|
||||
@@ -428,7 +431,7 @@ def _test_query() -> None:
|
||||
print("All query tests passed successfully!")
|
||||
|
||||
|
||||
def _test_upsert() -> None:
|
||||
def _test_upsert(directory: str) -> None:
|
||||
"""
|
||||
Tests upsert functionality by:
|
||||
1. Updating an existing account
|
||||
@@ -453,10 +456,10 @@ def _test_upsert() -> None:
|
||||
},
|
||||
]
|
||||
|
||||
_create_csv_file("Account", update_data, "update_data.csv")
|
||||
_create_csv_file(directory, "Account", update_data, "update_data.csv")
|
||||
|
||||
# Verify the update worked
|
||||
updated_record = get_record(_VALID_SALESFORCE_IDS[0])
|
||||
updated_record = get_record(directory, _VALID_SALESFORCE_IDS[0])
|
||||
assert updated_record is not None, "Updated record not found"
|
||||
assert updated_record.data["Name"] == "Acme Inc. Updated", "Name not updated"
|
||||
assert (
|
||||
@@ -464,7 +467,7 @@ def _test_upsert() -> None:
|
||||
), "Description not added"
|
||||
|
||||
# Verify the new record was created
|
||||
new_record = get_record(_VALID_SALESFORCE_IDS[2])
|
||||
new_record = get_record(directory, _VALID_SALESFORCE_IDS[2])
|
||||
assert new_record is not None, "New record not found"
|
||||
assert new_record.data["Name"] == "New Company Inc.", "New record name incorrect"
|
||||
assert new_record.data["AnnualRevenue"] == "1000000", "New record revenue incorrect"
|
||||
@@ -472,7 +475,7 @@ def _test_upsert() -> None:
|
||||
print("All upsert tests passed successfully!")
|
||||
|
||||
|
||||
def _test_relationships() -> None:
|
||||
def _test_relationships(directory: str) -> None:
|
||||
"""
|
||||
Tests relationship shelf updates and queries by:
|
||||
1. Creating test data with relationships
|
||||
@@ -513,11 +516,11 @@ def _test_relationships() -> None:
|
||||
|
||||
# Create and update CSV files for each object type
|
||||
for object_type, records in test_data.items():
|
||||
_create_csv_file(object_type, records, "relationship_test.csv")
|
||||
_create_csv_file(directory, object_type, records, "relationship_test.csv")
|
||||
|
||||
# Test relationship queries
|
||||
# All these objects should be children of Acme Inc.
|
||||
child_ids = get_child_ids(_VALID_SALESFORCE_IDS[0])
|
||||
child_ids = get_child_ids(directory, _VALID_SALESFORCE_IDS[0])
|
||||
assert len(child_ids) == 4, f"Expected 4 child objects, found {len(child_ids)}"
|
||||
assert _VALID_SALESFORCE_IDS[13] in child_ids, "Case 1 not found in relationship"
|
||||
assert _VALID_SALESFORCE_IDS[14] in child_ids, "Case 2 not found in relationship"
|
||||
@@ -527,7 +530,7 @@ def _test_relationships() -> None:
|
||||
), "Opportunity not found in relationship"
|
||||
|
||||
# Test querying relationships for a different account (should be empty)
|
||||
other_account_children = get_child_ids(_VALID_SALESFORCE_IDS[1])
|
||||
other_account_children = get_child_ids(directory, _VALID_SALESFORCE_IDS[1])
|
||||
assert (
|
||||
len(other_account_children) == 0
|
||||
), "Expected no children for different account"
|
||||
@@ -535,7 +538,7 @@ def _test_relationships() -> None:
|
||||
print("All relationship tests passed successfully!")
|
||||
|
||||
|
||||
def _test_account_with_children() -> None:
|
||||
def _test_account_with_children(directory: str) -> None:
|
||||
"""
|
||||
Tests querying all accounts and retrieving their child objects.
|
||||
This test verifies that:
|
||||
@@ -544,16 +547,16 @@ def _test_account_with_children() -> None:
|
||||
3. Child object data is complete and accurate
|
||||
"""
|
||||
# First get all account IDs
|
||||
account_ids = find_ids_by_type("Account")
|
||||
account_ids = find_ids_by_type(directory, "Account")
|
||||
assert len(account_ids) > 0, "No accounts found"
|
||||
|
||||
# For each account, get its children and verify the data
|
||||
for account_id in account_ids:
|
||||
account = get_record(account_id)
|
||||
account = get_record(directory, account_id)
|
||||
assert account is not None, f"Could not find account {account_id}"
|
||||
|
||||
# Get all child objects
|
||||
child_ids = get_child_ids(account_id)
|
||||
child_ids = get_child_ids(directory, account_id)
|
||||
|
||||
# For Acme Inc., verify specific relationships
|
||||
if account_id == _VALID_SALESFORCE_IDS[0]: # Acme Inc.
|
||||
@@ -564,7 +567,7 @@ def _test_account_with_children() -> None:
|
||||
# Get all child records
|
||||
child_records = []
|
||||
for child_id in child_ids:
|
||||
child_record = get_record(child_id)
|
||||
child_record = get_record(directory, child_id)
|
||||
if child_record is not None:
|
||||
child_records.append(child_record)
|
||||
# Verify Cases
|
||||
@@ -599,7 +602,7 @@ def _test_account_with_children() -> None:
|
||||
print("All account with children tests passed successfully!")
|
||||
|
||||
|
||||
def _test_relationship_updates() -> None:
|
||||
def _test_relationship_updates(directory: str) -> None:
|
||||
"""
|
||||
Tests that relationships are properly updated when a child object's parent reference changes.
|
||||
This test verifies:
|
||||
@@ -616,10 +619,10 @@ def _test_relationship_updates() -> None:
|
||||
"LastName": "Contact",
|
||||
}
|
||||
]
|
||||
_create_csv_file("Contact", initial_contact, "initial_contact.csv")
|
||||
_create_csv_file(directory, "Contact", initial_contact, "initial_contact.csv")
|
||||
|
||||
# Verify initial relationship
|
||||
acme_children = get_child_ids(_VALID_SALESFORCE_IDS[0])
|
||||
acme_children = get_child_ids(directory, _VALID_SALESFORCE_IDS[0])
|
||||
assert (
|
||||
_VALID_SALESFORCE_IDS[40] in acme_children
|
||||
), "Initial relationship not created"
|
||||
@@ -633,22 +636,22 @@ def _test_relationship_updates() -> None:
|
||||
"LastName": "Contact",
|
||||
}
|
||||
]
|
||||
_create_csv_file("Contact", updated_contact, "updated_contact.csv")
|
||||
_create_csv_file(directory, "Contact", updated_contact, "updated_contact.csv")
|
||||
|
||||
# Verify old relationship is removed
|
||||
acme_children = get_child_ids(_VALID_SALESFORCE_IDS[0])
|
||||
acme_children = get_child_ids(directory, _VALID_SALESFORCE_IDS[0])
|
||||
assert (
|
||||
_VALID_SALESFORCE_IDS[40] not in acme_children
|
||||
), "Old relationship not removed"
|
||||
|
||||
# Verify new relationship is created
|
||||
globex_children = get_child_ids(_VALID_SALESFORCE_IDS[1])
|
||||
globex_children = get_child_ids(directory, _VALID_SALESFORCE_IDS[1])
|
||||
assert _VALID_SALESFORCE_IDS[40] in globex_children, "New relationship not created"
|
||||
|
||||
print("All relationship update tests passed successfully!")
|
||||
|
||||
|
||||
def _test_get_affected_parent_ids() -> None:
|
||||
def _test_get_affected_parent_ids(directory: str) -> None:
|
||||
"""
|
||||
Tests get_affected_parent_ids functionality by verifying:
|
||||
1. IDs that are directly in the parent_types list are included
|
||||
@@ -683,13 +686,13 @@ def _test_get_affected_parent_ids() -> None:
|
||||
|
||||
# Create and update CSV files for test data
|
||||
for object_type, records in test_data.items():
|
||||
_create_csv_file(object_type, records)
|
||||
_create_csv_file(directory, object_type, records)
|
||||
|
||||
# Test Case 1: Account directly in updated_ids and parent_types
|
||||
updated_ids = [_VALID_SALESFORCE_IDS[1]] # Parent Account 2
|
||||
parent_types = ["Account"]
|
||||
affected_ids_by_type = dict(
|
||||
get_affected_parent_ids_by_type(updated_ids, parent_types)
|
||||
get_affected_parent_ids_by_type(directory, updated_ids, parent_types)
|
||||
)
|
||||
assert "Account" in affected_ids_by_type, "Account type not in affected_ids_by_type"
|
||||
assert (
|
||||
@@ -700,7 +703,7 @@ def _test_get_affected_parent_ids() -> None:
|
||||
updated_ids = [_VALID_SALESFORCE_IDS[40]] # Child Contact
|
||||
parent_types = ["Account"]
|
||||
affected_ids_by_type = dict(
|
||||
get_affected_parent_ids_by_type(updated_ids, parent_types)
|
||||
get_affected_parent_ids_by_type(directory, updated_ids, parent_types)
|
||||
)
|
||||
assert "Account" in affected_ids_by_type, "Account type not in affected_ids_by_type"
|
||||
assert (
|
||||
@@ -711,7 +714,7 @@ def _test_get_affected_parent_ids() -> None:
|
||||
updated_ids = [_VALID_SALESFORCE_IDS[1], _VALID_SALESFORCE_IDS[40]] # Both cases
|
||||
parent_types = ["Account"]
|
||||
affected_ids_by_type = dict(
|
||||
get_affected_parent_ids_by_type(updated_ids, parent_types)
|
||||
get_affected_parent_ids_by_type(directory, updated_ids, parent_types)
|
||||
)
|
||||
assert "Account" in affected_ids_by_type, "Account type not in affected_ids_by_type"
|
||||
affected_ids = affected_ids_by_type["Account"]
|
||||
@@ -726,7 +729,7 @@ def _test_get_affected_parent_ids() -> None:
|
||||
updated_ids = [_VALID_SALESFORCE_IDS[40]] # Child Contact
|
||||
parent_types = ["Opportunity"] # Wrong type
|
||||
affected_ids_by_type = dict(
|
||||
get_affected_parent_ids_by_type(updated_ids, parent_types)
|
||||
get_affected_parent_ids_by_type(directory, updated_ids, parent_types)
|
||||
)
|
||||
assert len(affected_ids_by_type) == 0, "Should return empty dict when no matches"
|
||||
|
||||
@@ -734,13 +737,15 @@ def _test_get_affected_parent_ids() -> None:
|
||||
|
||||
|
||||
def test_salesforce_sqlite() -> None:
|
||||
_clear_sf_db()
|
||||
init_db()
|
||||
_create_csv_with_example_data()
|
||||
_test_query()
|
||||
_test_upsert()
|
||||
_test_relationships()
|
||||
_test_account_with_children()
|
||||
_test_relationship_updates()
|
||||
_test_get_affected_parent_ids()
|
||||
_clear_sf_db()
|
||||
directory = BASE_DATA_PATH
|
||||
|
||||
_clear_sf_db(directory)
|
||||
init_db(directory)
|
||||
_create_csv_with_example_data(directory)
|
||||
_test_query(directory)
|
||||
_test_upsert(directory)
|
||||
_test_relationships(directory)
|
||||
_test_account_with_children(directory)
|
||||
_test_relationship_updates(directory)
|
||||
_test_get_affected_parent_ids(directory)
|
||||
_clear_sf_db(directory)
|
||||
|
||||
208
backend/tests/unit/onyx/indexing/test_censoring.py
Normal file
208
backend/tests/unit/onyx/indexing/test_censoring.py
Normal file
@@ -0,0 +1,208 @@
|
||||
import os
|
||||
from unittest.mock import MagicMock
|
||||
from unittest.mock import patch
|
||||
|
||||
import pytest
|
||||
|
||||
from onyx.configs.constants import DocumentSource
|
||||
from onyx.context.search.models import InferenceChunk
|
||||
from onyx.db.models import User
|
||||
from onyx.utils.variable_functionality import fetch_ee_implementation_or_noop
|
||||
|
||||
_post_query_chunk_censoring = fetch_ee_implementation_or_noop(
|
||||
"onyx.external_permissions.post_query_censoring", "_post_query_chunk_censoring"
|
||||
)
|
||||
|
||||
|
||||
@pytest.mark.skipif(
|
||||
os.environ.get("ENABLE_PAID_ENTERPRISE_EDITION_FEATURES", "").lower() != "true",
|
||||
reason="Permissions tests are enterprise only",
|
||||
)
|
||||
class TestPostQueryChunkCensoring:
|
||||
@pytest.fixture(autouse=True)
|
||||
def setUp(self) -> None:
|
||||
self.mock_user = User(id=1, email="test@example.com")
|
||||
self.mock_chunk_1 = InferenceChunk(
|
||||
document_id="doc1",
|
||||
chunk_id=1,
|
||||
content="chunk1 content",
|
||||
source_type=DocumentSource.SALESFORCE,
|
||||
semantic_identifier="doc1_1",
|
||||
title="doc1",
|
||||
boost=1,
|
||||
recency_bias=1.0,
|
||||
score=0.9,
|
||||
hidden=False,
|
||||
metadata={},
|
||||
match_highlights=[],
|
||||
doc_summary="doc1 summary",
|
||||
chunk_context="doc1 context",
|
||||
updated_at=None,
|
||||
image_file_name=None,
|
||||
source_links={},
|
||||
section_continuation=False,
|
||||
blurb="chunk1",
|
||||
)
|
||||
self.mock_chunk_2 = InferenceChunk(
|
||||
document_id="doc2",
|
||||
chunk_id=2,
|
||||
content="chunk2 content",
|
||||
source_type=DocumentSource.SLACK,
|
||||
semantic_identifier="doc2_2",
|
||||
title="doc2",
|
||||
boost=1,
|
||||
recency_bias=1.0,
|
||||
score=0.8,
|
||||
hidden=False,
|
||||
metadata={},
|
||||
match_highlights=[],
|
||||
doc_summary="doc2 summary",
|
||||
chunk_context="doc2 context",
|
||||
updated_at=None,
|
||||
image_file_name=None,
|
||||
source_links={},
|
||||
section_continuation=False,
|
||||
blurb="chunk2",
|
||||
)
|
||||
self.mock_chunk_3 = InferenceChunk(
|
||||
document_id="doc3",
|
||||
chunk_id=3,
|
||||
content="chunk3 content",
|
||||
source_type=DocumentSource.SALESFORCE,
|
||||
semantic_identifier="doc3_3",
|
||||
title="doc3",
|
||||
boost=1,
|
||||
recency_bias=1.0,
|
||||
score=0.7,
|
||||
hidden=False,
|
||||
metadata={},
|
||||
match_highlights=[],
|
||||
doc_summary="doc3 summary",
|
||||
chunk_context="doc3 context",
|
||||
updated_at=None,
|
||||
image_file_name=None,
|
||||
source_links={},
|
||||
section_continuation=False,
|
||||
blurb="chunk3",
|
||||
)
|
||||
self.mock_chunk_4 = InferenceChunk(
|
||||
document_id="doc4",
|
||||
chunk_id=4,
|
||||
content="chunk4 content",
|
||||
source_type=DocumentSource.SALESFORCE,
|
||||
semantic_identifier="doc4_4",
|
||||
title="doc4",
|
||||
boost=1,
|
||||
recency_bias=1.0,
|
||||
score=0.6,
|
||||
hidden=False,
|
||||
metadata={},
|
||||
match_highlights=[],
|
||||
doc_summary="doc4 summary",
|
||||
chunk_context="doc4 context",
|
||||
updated_at=None,
|
||||
image_file_name=None,
|
||||
source_links={},
|
||||
section_continuation=False,
|
||||
blurb="chunk4",
|
||||
)
|
||||
|
||||
@patch(
|
||||
"ee.onyx.external_permissions.post_query_censoring._get_all_censoring_enabled_sources"
|
||||
)
|
||||
def test_post_query_chunk_censoring_no_user(
|
||||
self, mock_get_sources: MagicMock
|
||||
) -> None:
|
||||
mock_get_sources.return_value = {DocumentSource.SALESFORCE}
|
||||
chunks = [self.mock_chunk_1, self.mock_chunk_2]
|
||||
result = _post_query_chunk_censoring(chunks, None)
|
||||
assert result == chunks
|
||||
|
||||
@patch(
|
||||
"ee.onyx.external_permissions.post_query_censoring._get_all_censoring_enabled_sources"
|
||||
)
|
||||
@patch(
|
||||
"ee.onyx.external_permissions.post_query_censoring.DOC_SOURCE_TO_CHUNK_CENSORING_FUNCTION"
|
||||
)
|
||||
def test_post_query_chunk_censoring_salesforce_censored(
|
||||
self, mock_censor_func: MagicMock, mock_get_sources: MagicMock
|
||||
) -> None:
|
||||
mock_get_sources.return_value = {DocumentSource.SALESFORCE}
|
||||
mock_censor_func_impl = MagicMock(
|
||||
return_value=[self.mock_chunk_1]
|
||||
) # Only return chunk 1
|
||||
mock_censor_func.__getitem__.return_value = mock_censor_func_impl
|
||||
|
||||
chunks = [self.mock_chunk_1, self.mock_chunk_2, self.mock_chunk_3]
|
||||
result = _post_query_chunk_censoring(chunks, self.mock_user)
|
||||
assert len(result) == 2
|
||||
assert self.mock_chunk_1 in result
|
||||
assert self.mock_chunk_2 in result
|
||||
assert self.mock_chunk_3 not in result
|
||||
mock_censor_func_impl.assert_called_once()
|
||||
|
||||
@patch(
|
||||
"ee.onyx.external_permissions.post_query_censoring._get_all_censoring_enabled_sources"
|
||||
)
|
||||
@patch(
|
||||
"ee.onyx.external_permissions.post_query_censoring.DOC_SOURCE_TO_CHUNK_CENSORING_FUNCTION"
|
||||
)
|
||||
def test_post_query_chunk_censoring_salesforce_error(
|
||||
self, mock_censor_func: MagicMock, mock_get_sources: MagicMock
|
||||
) -> None:
|
||||
mock_get_sources.return_value = {DocumentSource.SALESFORCE}
|
||||
mock_censor_func_impl = MagicMock(side_effect=Exception("Censoring error"))
|
||||
mock_censor_func.__getitem__.return_value = mock_censor_func_impl
|
||||
|
||||
chunks = [self.mock_chunk_1, self.mock_chunk_2, self.mock_chunk_3]
|
||||
result = _post_query_chunk_censoring(chunks, self.mock_user)
|
||||
assert len(result) == 1
|
||||
assert self.mock_chunk_2 in result
|
||||
mock_censor_func_impl.assert_called_once()
|
||||
|
||||
@patch(
|
||||
"ee.onyx.external_permissions.post_query_censoring._get_all_censoring_enabled_sources"
|
||||
)
|
||||
@patch(
|
||||
"ee.onyx.external_permissions.post_query_censoring.DOC_SOURCE_TO_CHUNK_CENSORING_FUNCTION"
|
||||
)
|
||||
def test_post_query_chunk_censoring_no_censoring(
|
||||
self, mock_censor_func: MagicMock, mock_get_sources: MagicMock
|
||||
) -> None:
|
||||
mock_get_sources.return_value = set() # No sources to censor
|
||||
mock_censor_func_impl = MagicMock()
|
||||
mock_censor_func.__getitem__.return_value = mock_censor_func_impl
|
||||
|
||||
chunks = [self.mock_chunk_1, self.mock_chunk_2, self.mock_chunk_3]
|
||||
result = _post_query_chunk_censoring(chunks, self.mock_user)
|
||||
assert result == chunks
|
||||
mock_censor_func_impl.assert_not_called()
|
||||
|
||||
@patch(
|
||||
"ee.onyx.external_permissions.post_query_censoring._get_all_censoring_enabled_sources"
|
||||
)
|
||||
@patch(
|
||||
"ee.onyx.external_permissions.post_query_censoring.DOC_SOURCE_TO_CHUNK_CENSORING_FUNCTION"
|
||||
)
|
||||
def test_post_query_chunk_censoring_order_maintained(
|
||||
self, mock_censor_func: MagicMock, mock_get_sources: MagicMock
|
||||
) -> None:
|
||||
mock_get_sources.return_value = {DocumentSource.SALESFORCE}
|
||||
mock_censor_func_impl = MagicMock(
|
||||
return_value=[self.mock_chunk_3, self.mock_chunk_1]
|
||||
) # Return chunk 3 and 1
|
||||
mock_censor_func.__getitem__.return_value = mock_censor_func_impl
|
||||
|
||||
chunks = [
|
||||
self.mock_chunk_1,
|
||||
self.mock_chunk_2,
|
||||
self.mock_chunk_3,
|
||||
self.mock_chunk_4,
|
||||
]
|
||||
result = _post_query_chunk_censoring(chunks, self.mock_user)
|
||||
assert len(result) == 3
|
||||
assert result[0] == self.mock_chunk_1
|
||||
assert result[1] == self.mock_chunk_2
|
||||
assert result[2] == self.mock_chunk_3
|
||||
assert self.mock_chunk_4 not in result
|
||||
mock_censor_func_impl.assert_called_once()
|
||||
270
backend/tests/unit/onyx/utils/test_vespa_query.py
Normal file
270
backend/tests/unit/onyx/utils/test_vespa_query.py
Normal file
@@ -0,0 +1,270 @@
|
||||
from datetime import datetime
|
||||
from datetime import timedelta
|
||||
from datetime import timezone
|
||||
|
||||
from onyx.configs.constants import DocumentSource
|
||||
from onyx.configs.constants import INDEX_SEPARATOR
|
||||
from onyx.context.search.models import IndexFilters
|
||||
from onyx.context.search.models import Tag
|
||||
from onyx.document_index.vespa.shared_utils.vespa_request_builders import (
|
||||
build_vespa_filters,
|
||||
)
|
||||
from onyx.document_index.vespa_constants import DOC_UPDATED_AT
|
||||
from onyx.document_index.vespa_constants import DOCUMENT_SETS
|
||||
from onyx.document_index.vespa_constants import HIDDEN
|
||||
from onyx.document_index.vespa_constants import METADATA_LIST
|
||||
from onyx.document_index.vespa_constants import SOURCE_TYPE
|
||||
from onyx.document_index.vespa_constants import TENANT_ID
|
||||
from onyx.document_index.vespa_constants import USER_FILE
|
||||
from onyx.document_index.vespa_constants import USER_FOLDER
|
||||
from shared_configs.configs import MULTI_TENANT
|
||||
|
||||
# Import the function under test
|
||||
|
||||
|
||||
class TestBuildVespaFilters:
|
||||
def test_empty_filters(self) -> None:
|
||||
"""Test with empty filters object."""
|
||||
filters = IndexFilters(access_control_list=[])
|
||||
result = build_vespa_filters(filters)
|
||||
assert result == f"!({HIDDEN}=true) and "
|
||||
|
||||
# With trailing AND removed
|
||||
result = build_vespa_filters(filters, remove_trailing_and=True)
|
||||
assert result == f"!({HIDDEN}=true)"
|
||||
|
||||
def test_include_hidden(self) -> None:
|
||||
"""Test with include_hidden flag."""
|
||||
filters = IndexFilters(access_control_list=[])
|
||||
result = build_vespa_filters(filters, include_hidden=True)
|
||||
assert result == "" # No filters applied when including hidden
|
||||
|
||||
# With some other filter to ensure proper AND chaining
|
||||
filters = IndexFilters(access_control_list=[], source_type=[DocumentSource.WEB])
|
||||
result = build_vespa_filters(filters, include_hidden=True)
|
||||
assert result == f'({SOURCE_TYPE} contains "web") and '
|
||||
|
||||
def test_acl(self) -> None:
|
||||
"""Test with acls."""
|
||||
# Single ACL
|
||||
filters = IndexFilters(access_control_list=["user1"])
|
||||
result = build_vespa_filters(filters)
|
||||
assert (
|
||||
result
|
||||
== f'!({HIDDEN}=true) and (access_control_list contains "user1") and '
|
||||
)
|
||||
|
||||
# Multiple ACL's
|
||||
filters = IndexFilters(access_control_list=["user2", "group2"])
|
||||
result = build_vespa_filters(filters)
|
||||
assert (
|
||||
result
|
||||
== f'!({HIDDEN}=true) and (access_control_list contains "user2" or access_control_list contains "group2") and '
|
||||
)
|
||||
|
||||
def test_tenant_filter(self) -> None:
|
||||
"""Test tenant ID filtering."""
|
||||
# With tenant ID
|
||||
if MULTI_TENANT:
|
||||
filters = IndexFilters(access_control_list=[], tenant_id="tenant1")
|
||||
result = build_vespa_filters(filters)
|
||||
assert (
|
||||
f'!({HIDDEN}=true) and ({TENANT_ID} contains "tenant1") and ' == result
|
||||
)
|
||||
|
||||
# No tenant ID
|
||||
filters = IndexFilters(access_control_list=[], tenant_id=None)
|
||||
result = build_vespa_filters(filters)
|
||||
assert f"!({HIDDEN}=true) and " == result
|
||||
|
||||
def test_source_type_filter(self) -> None:
|
||||
"""Test source type filtering."""
|
||||
# Single source type
|
||||
filters = IndexFilters(access_control_list=[], source_type=[DocumentSource.WEB])
|
||||
result = build_vespa_filters(filters)
|
||||
assert f'!({HIDDEN}=true) and ({SOURCE_TYPE} contains "web") and ' == result
|
||||
|
||||
# Multiple source types
|
||||
filters = IndexFilters(
|
||||
access_control_list=[],
|
||||
source_type=[DocumentSource.WEB, DocumentSource.JIRA],
|
||||
)
|
||||
result = build_vespa_filters(filters)
|
||||
assert (
|
||||
f'!({HIDDEN}=true) and ({SOURCE_TYPE} contains "web" or {SOURCE_TYPE} contains "jira") and '
|
||||
== result
|
||||
)
|
||||
|
||||
# Empty source type list
|
||||
filters = IndexFilters(access_control_list=[], source_type=[])
|
||||
result = build_vespa_filters(filters)
|
||||
assert f"!({HIDDEN}=true) and " == result
|
||||
|
||||
def test_tag_filters(self) -> None:
|
||||
"""Test tag filtering."""
|
||||
# Single tag
|
||||
filters = IndexFilters(
|
||||
access_control_list=[], tags=[Tag(tag_key="color", tag_value="red")]
|
||||
)
|
||||
result = build_vespa_filters(filters)
|
||||
assert (
|
||||
f'!({HIDDEN}=true) and ({METADATA_LIST} contains "color{INDEX_SEPARATOR}red") and '
|
||||
== result
|
||||
)
|
||||
|
||||
# Multiple tags
|
||||
filters = IndexFilters(
|
||||
access_control_list=[],
|
||||
tags=[
|
||||
Tag(tag_key="color", tag_value="red"),
|
||||
Tag(tag_key="size", tag_value="large"),
|
||||
],
|
||||
)
|
||||
result = build_vespa_filters(filters)
|
||||
expected = (
|
||||
f'!({HIDDEN}=true) and ({METADATA_LIST} contains "color{INDEX_SEPARATOR}red" '
|
||||
f'or {METADATA_LIST} contains "size{INDEX_SEPARATOR}large") and '
|
||||
)
|
||||
assert expected == result
|
||||
|
||||
# Empty tags list
|
||||
filters = IndexFilters(access_control_list=[], tags=[])
|
||||
result = build_vespa_filters(filters)
|
||||
assert f"!({HIDDEN}=true) and " == result
|
||||
|
||||
def test_document_sets_filter(self) -> None:
|
||||
"""Test document sets filtering."""
|
||||
# Single document set
|
||||
filters = IndexFilters(access_control_list=[], document_set=["set1"])
|
||||
result = build_vespa_filters(filters)
|
||||
assert f'!({HIDDEN}=true) and ({DOCUMENT_SETS} contains "set1") and ' == result
|
||||
|
||||
# Multiple document sets
|
||||
filters = IndexFilters(access_control_list=[], document_set=["set1", "set2"])
|
||||
result = build_vespa_filters(filters)
|
||||
assert (
|
||||
f'!({HIDDEN}=true) and ({DOCUMENT_SETS} contains "set1" or {DOCUMENT_SETS} contains "set2") and '
|
||||
== result
|
||||
)
|
||||
|
||||
# Empty document sets
|
||||
filters = IndexFilters(access_control_list=[], document_set=[])
|
||||
result = build_vespa_filters(filters)
|
||||
assert f"!({HIDDEN}=true) and " == result
|
||||
|
||||
def test_user_file_ids_filter(self) -> None:
|
||||
"""Test user file IDs filtering."""
|
||||
# Single user file ID
|
||||
filters = IndexFilters(access_control_list=[], user_file_ids=[123])
|
||||
result = build_vespa_filters(filters)
|
||||
assert f"!({HIDDEN}=true) and ({USER_FILE} = 123) and " == result
|
||||
|
||||
# Multiple user file IDs
|
||||
filters = IndexFilters(access_control_list=[], user_file_ids=[123, 456])
|
||||
result = build_vespa_filters(filters)
|
||||
assert (
|
||||
f"!({HIDDEN}=true) and ({USER_FILE} = 123 or {USER_FILE} = 456) and "
|
||||
== result
|
||||
)
|
||||
|
||||
# Empty user file IDs
|
||||
filters = IndexFilters(access_control_list=[], user_file_ids=[])
|
||||
result = build_vespa_filters(filters)
|
||||
assert f"!({HIDDEN}=true) and " == result
|
||||
|
||||
def test_user_folder_ids_filter(self) -> None:
|
||||
"""Test user folder IDs filtering."""
|
||||
# Single user folder ID
|
||||
filters = IndexFilters(access_control_list=[], user_folder_ids=[789])
|
||||
result = build_vespa_filters(filters)
|
||||
assert f"!({HIDDEN}=true) and ({USER_FOLDER} = 789) and " == result
|
||||
|
||||
# Multiple user folder IDs
|
||||
filters = IndexFilters(access_control_list=[], user_folder_ids=[789, 101])
|
||||
result = build_vespa_filters(filters)
|
||||
assert (
|
||||
f"!({HIDDEN}=true) and ({USER_FOLDER} = 789 or {USER_FOLDER} = 101) and "
|
||||
== result
|
||||
)
|
||||
|
||||
# Empty user folder IDs
|
||||
filters = IndexFilters(access_control_list=[], user_folder_ids=[])
|
||||
result = build_vespa_filters(filters)
|
||||
assert f"!({HIDDEN}=true) and " == result
|
||||
|
||||
def test_time_cutoff_filter(self) -> None:
|
||||
"""Test time cutoff filtering."""
|
||||
# With cutoff time
|
||||
cutoff_time = datetime(2023, 1, 1, tzinfo=timezone.utc)
|
||||
filters = IndexFilters(access_control_list=[], time_cutoff=cutoff_time)
|
||||
result = build_vespa_filters(filters)
|
||||
cutoff_secs = int(cutoff_time.timestamp())
|
||||
assert (
|
||||
f"!({HIDDEN}=true) and !({DOC_UPDATED_AT} < {cutoff_secs}) and " == result
|
||||
)
|
||||
|
||||
# No cutoff time
|
||||
filters = IndexFilters(access_control_list=[], time_cutoff=None)
|
||||
result = build_vespa_filters(filters)
|
||||
assert f"!({HIDDEN}=true) and " == result
|
||||
|
||||
# Test untimed logic (when cutoff is old enough)
|
||||
old_cutoff = datetime.now(timezone.utc) - timedelta(days=100)
|
||||
filters = IndexFilters(access_control_list=[], time_cutoff=old_cutoff)
|
||||
result = build_vespa_filters(filters)
|
||||
old_cutoff_secs = int(old_cutoff.timestamp())
|
||||
assert (
|
||||
f"!({HIDDEN}=true) and !({DOC_UPDATED_AT} < {old_cutoff_secs}) and "
|
||||
== result
|
||||
)
|
||||
|
||||
def test_combined_filters(self) -> None:
|
||||
"""Test combining multiple filter types."""
|
||||
filters = IndexFilters(
|
||||
access_control_list=["user1", "group1"],
|
||||
source_type=[DocumentSource.WEB],
|
||||
tags=[Tag(tag_key="color", tag_value="red")],
|
||||
document_set=["set1"],
|
||||
user_file_ids=[123],
|
||||
user_folder_ids=[789],
|
||||
time_cutoff=datetime(2023, 1, 1, tzinfo=timezone.utc),
|
||||
)
|
||||
|
||||
result = build_vespa_filters(filters)
|
||||
|
||||
# Build expected result piece by piece for readability
|
||||
expected = f"!({HIDDEN}=true) and "
|
||||
expected += (
|
||||
'(access_control_list contains "user1" or '
|
||||
'access_control_list contains "group1") and '
|
||||
)
|
||||
expected += f'({SOURCE_TYPE} contains "web") and '
|
||||
expected += f'({METADATA_LIST} contains "color{INDEX_SEPARATOR}red") and '
|
||||
expected += f'({DOCUMENT_SETS} contains "set1") and '
|
||||
expected += f"({USER_FILE} = 123) and "
|
||||
expected += f"({USER_FOLDER} = 789) and "
|
||||
cutoff_secs = int(datetime(2023, 1, 1, tzinfo=timezone.utc).timestamp())
|
||||
expected += f"!({DOC_UPDATED_AT} < {cutoff_secs}) and "
|
||||
|
||||
assert expected == result
|
||||
|
||||
# With trailing AND removed
|
||||
result_no_trailing = build_vespa_filters(filters, remove_trailing_and=True)
|
||||
assert expected[:-5] == result_no_trailing # Remove trailing " and "
|
||||
|
||||
def test_empty_or_none_values(self) -> None:
|
||||
"""Test with empty or None values in filter lists."""
|
||||
# Empty strings in document set
|
||||
filters = IndexFilters(
|
||||
access_control_list=[], document_set=["set1", "", "set2"]
|
||||
)
|
||||
result = build_vespa_filters(filters)
|
||||
assert (
|
||||
f'!({HIDDEN}=true) and ({DOCUMENT_SETS} contains "set1" or {DOCUMENT_SETS} contains "set2") and '
|
||||
== result
|
||||
)
|
||||
|
||||
# All empty strings in document set
|
||||
filters = IndexFilters(access_control_list=[], document_set=["", ""])
|
||||
result = build_vespa_filters(filters)
|
||||
assert f"!({HIDDEN}=true) and " == result
|
||||
529
company_links.csv
Normal file
529
company_links.csv
Normal file
@@ -0,0 +1,529 @@
|
||||
Company,Link
|
||||
1849-bio,https://x.com/1849bio
|
||||
1stcollab,https://twitter.com/ycombinator
|
||||
abundant,https://x.com/abundant_labs
|
||||
activepieces,https://mobile.twitter.com/mabuaboud
|
||||
acx,https://twitter.com/ycombinator
|
||||
adri-ai,https://twitter.com/darshitac_
|
||||
affil-ai,https://twitter.com/ycombinator
|
||||
agave,https://twitter.com/moyicat
|
||||
aglide,https://twitter.com/pdmcguckian
|
||||
ai-2,https://twitter.com/the_yuppy
|
||||
ai-sell,https://x.com/liuzjerry
|
||||
airtrain-ai,https://twitter.com/neutralino1
|
||||
aisdr,https://twitter.com/YuriyZaremba
|
||||
alex,https://x.com/DanielEdrisian
|
||||
alga-biosciences,https://twitter.com/algabiosciences
|
||||
alguna,https://twitter.com/aleks_djekic
|
||||
alixia,https://twitter.com/ycombinator
|
||||
aminoanalytica,https://x.com/lilwuuzivert
|
||||
anara,https://twitter.com/naveedjanmo
|
||||
andi,https://twitter.com/MiamiAngela
|
||||
andoria,https://x.com/dbudimane
|
||||
andromeda-surgical,https://twitter.com/nickdamian0
|
||||
anglera,https://twitter.com/ycombinator
|
||||
angstrom-ai,https://twitter.com/JaviAC7
|
||||
ankr-health,https://twitter.com/Ankr_us
|
||||
apoxy,https://twitter.com/ycombinator
|
||||
apten,https://twitter.com/dho1357
|
||||
aragorn-ai,https://twitter.com/ycombinator
|
||||
arc-2,https://twitter.com/DarkMirage
|
||||
archilabs,https://twitter.com/ycombinator
|
||||
arcimus,https://twitter.com/husseinsyed73
|
||||
argovox,https://www.argovox.com/
|
||||
artemis-search,https://twitter.com/ycombinator
|
||||
artie,https://x.com/JacquelineSYC19
|
||||
asklio,https://twitter.com/butterflock
|
||||
atlas-2,https://twitter.com/jobryan
|
||||
attain,https://twitter.com/aamir_hudda
|
||||
autocomputer,https://twitter.com/madhavsinghal_
|
||||
automat,https://twitter.com/lucas0choa
|
||||
automorphic,https://twitter.com/sandkoan
|
||||
autopallet-robotics,https://twitter.com/ycombinator
|
||||
autumn-labs,https://twitter.com/ycombinator
|
||||
aviary,https://twitter.com/ycombinator
|
||||
azuki,https://twitter.com/VamptVo
|
||||
banabo,https://twitter.com/ycombinator
|
||||
baseline-ai,https://twitter.com/ycombinator
|
||||
baserun,https://twitter.com/effyyzhang
|
||||
benchify,https://www.x.com/maxvonhippel
|
||||
berry,https://twitter.com/annchanyt
|
||||
bifrost,https://twitter.com/0xMysterious
|
||||
bifrost-orbital,https://x.com/ionkarbatra
|
||||
biggerpicture,https://twitter.com/ycombinator
|
||||
biocartesian,https://twitter.com/ycombinator
|
||||
bland-ai,https://twitter.com/zaygranet
|
||||
blast,https://x.com/useblast
|
||||
blaze,https://twitter.com/larfy_rothwell
|
||||
bluebirds,https://twitter.com/RohanPunamia
|
||||
bluedot,https://twitter.com/selinayfilizp
|
||||
bluehill-payments,https://twitter.com/HimanshuMinocha
|
||||
blyss,https://twitter.com/blyssdev
|
||||
bolto,https://twitter.com/mrinalsingh02?lang=en
|
||||
botcity,https://twitter.com/lorhancaproni
|
||||
boundo,https://twitter.com/ycombinator
|
||||
bramble,https://x.com/meksikanpijha
|
||||
bricksai,https://twitter.com/ycombinator
|
||||
broccoli-ai,https://twitter.com/abhishekjain25
|
||||
bronco-ai,https://twitter.com/dluozhang
|
||||
bunting-labs,https://twitter.com/normconstant
|
||||
byterat,https://twitter.com/penelopekjones_
|
||||
callback,https://twitter.com/ycombinator
|
||||
cambio-2,https://twitter.com/ycombinator
|
||||
camfer,https://x.com/AryaBastani
|
||||
campfire-2,https://twitter.com/ycombinator
|
||||
campfire-applied-ai-company,https://twitter.com/siamakfr
|
||||
candid,https://x.com/kesavkosana
|
||||
canvas,https://x.com/essamsleiman
|
||||
capsule,https://twitter.com/kelsey_pedersen
|
||||
cardinal,http://twitter.com/nadavwiz
|
||||
cardinal-gray,https://twitter.com/ycombinator
|
||||
cargo,https://twitter.com/aureeaubert
|
||||
cartage,https://twitter.com/ycombinator
|
||||
cashmere,https://twitter.com/shashankbuilds
|
||||
cedalio,https://twitter.com/LucianaReznik
|
||||
cekura-2,https://x.com/tarush_agarwal_
|
||||
central,https://twitter.com/nilaymod
|
||||
champ,https://twitter.com/ycombinator
|
||||
cheers,https://twitter.com/ycombinator
|
||||
chequpi,https://twitter.com/sudshekhar02
|
||||
chima,https://twitter.com/nikharanirghin
|
||||
cinapse,https://www.twitter.com/hgphillipsiv
|
||||
ciro,https://twitter.com/davidjwiner
|
||||
clara,https://x.com/levinsonjon
|
||||
cleancard,https://twitter.com/_tom_dot_com
|
||||
clearspace,https://twitter.com/rbfasho
|
||||
cobbery,https://twitter.com/Dan_The_Goodman
|
||||
codeviz,https://x.com/liam_prev
|
||||
coil-inc,https://twitter.com/ycombinator
|
||||
coldreach,https://twitter.com/ycombinator
|
||||
combinehealth,https://twitter.com/ycombinator
|
||||
comfy-deploy,https://twitter.com/nicholaskkao
|
||||
complete,https://twitter.com/ranimavram
|
||||
conductor-quantum,https://twitter.com/BrandonSeverin
|
||||
conduit,https://twitter.com/ycombinator
|
||||
continue,https://twitter.com/tylerjdunn
|
||||
contour,https://twitter.com/ycombinator
|
||||
coperniq,https://twitter.com/abdullahzandani
|
||||
corgea,https://twitter.com/asadeddin
|
||||
corgi,https://twitter.com/nico_laqua?lang=en
|
||||
corgi-labs,https://twitter.com/ycombinator
|
||||
coris,https://twitter.com/psvinodh
|
||||
cosine,https://twitter.com/AlistairPullen
|
||||
courtyard-io,https://twitter.com/lejeunedall
|
||||
coverage-cat,https://twitter.com/coveragecats
|
||||
craftos,https://twitter.com/wa3l
|
||||
craniometrix,https://craniometrix.com
|
||||
ctgt,https://twitter.com/cyrilgorlla
|
||||
curo,https://x.com/EnergizedAndrew
|
||||
dagworks-inc,https://twitter.com/dagworks
|
||||
dart,https://twitter.com/milad3malek
|
||||
dashdive,https://twitter.com/micahawheat
|
||||
dataleap,https://twitter.com/jh_damm
|
||||
decisional-ai,https://x.com/groovetandon
|
||||
decoda-health,https://twitter.com/ycombinator
|
||||
deepsilicon,https://x.com/abhireddy2004
|
||||
delfino-ai,https://twitter.com/ycombinator
|
||||
demo-gorilla,https://twitter.com/ycombinator
|
||||
demospace,https://www.twitter.com/nick_fiacco
|
||||
dench-com,https://www.twitter.com/markrachapoom
|
||||
denormalized,https://twitter.com/IAmMattGreen
|
||||
dev-tools-ai,https://twitter.com/ycombinator
|
||||
diffusion-studio,https://x.com/MatthiasRuiz22
|
||||
digitalcarbon,https://x.com/CtrlGuruDelete
|
||||
dimely,https://x.com/UseDimely
|
||||
disputeninja,https://twitter.com/legitmaxwu
|
||||
diversion,https://twitter.com/sasham1
|
||||
dmodel,https://twitter.com/dmooooon
|
||||
doctor-droid,https://twitter.com/TheBengaluruGuy
|
||||
dodo,https://x.com/dominik_moehrle
|
||||
dojah-inc,https://twitter.com/ololaday
|
||||
domu-technology-inc,https://twitter.com/ycombinator
|
||||
dr-treat,https://twitter.com/rakeshtondon
|
||||
dreamrp,https://x.com/dreamrpofficial
|
||||
drivingforce,https://twitter.com/drivingforcehq
|
||||
dynamo-ai,https://twitter.com/dynamo_fl
|
||||
edgebit,https://twitter.com/robszumski
|
||||
educato-ai,https://x.com/FelixGabler
|
||||
electric-air-2,https://twitter.com/JezOsborne
|
||||
ember,https://twitter.com/hsinleiwang
|
||||
ember-robotics,https://twitter.com/ycombinator
|
||||
emergent,https://twitter.com/mukundjha
|
||||
emobi,https://twitter.com/ycombinator
|
||||
entangl,https://twitter.com/Shapol_m
|
||||
envelope,https://twitter.com/joshuakcockrell
|
||||
et-al,https://twitter.com/ycombinator
|
||||
eugit-therapeutics,http://www.eugittx.com
|
||||
eventual,https://twitter.com/sammy_sidhu
|
||||
evoly,https://twitter.com/ycombinator
|
||||
expand-ai,https://twitter.com/timsuchanek
|
||||
ezdubs,https://twitter.com/PadmanabhanKri
|
||||
fabius,https://twitter.com/adayNU
|
||||
fazeshift,https://twitter.com/ycombinator
|
||||
felafax,https://twitter.com/ThatNithin
|
||||
fetchr,https://twitter.com/CalvinnChenn
|
||||
fiber-ai,https://twitter.com/AdiAgashe
|
||||
ficra,https://x.com/ficra_ai
|
||||
fiddlecube,https://twitter.com/nupoor_neha
|
||||
finic,https://twitter.com/jfan001
|
||||
finta,https://www.twitter.com/andywang
|
||||
fintool,https://twitter.com/nicbstme
|
||||
finvest,https://twitter.com/shivambharuka
|
||||
firecrawl,https://x.com/ericciarla
|
||||
firstwork,https://twitter.com/techie_Shubham
|
||||
fixa,https://x.com/jonathanzliu
|
||||
flair-health,https://twitter.com/adivawhocodes
|
||||
fleek,https://twitter.com/ycombinator
|
||||
fleetworks,https://twitter.com/ycombinator
|
||||
flike,https://twitter.com/yajmch
|
||||
flint-2,https://twitter.com/hungrysohan
|
||||
floworks,https://twitter.com/sarthaks92
|
||||
focus-buddy,https://twitter.com/yash14700/
|
||||
forerunner-ai,https://x.com/willnida0
|
||||
founders,https://twitter.com/ycombinator
|
||||
foundry,https://x.com/FoundryAI_
|
||||
freestyle,https://x.com/benswerd
|
||||
fresco,https://twitter.com/ycombinator
|
||||
friday,https://x.com/AllenNaliath
|
||||
frigade,https://twitter.com/FrigadeHQ
|
||||
futureclinic,https://twitter.com/usamasyedmd
|
||||
gait,https://twitter.com/AlexYHsia
|
||||
galini,https://twitter.com/ycombinator
|
||||
gauge,https://twitter.com/the1024th
|
||||
gecko-security,https://x.com/jjjutla
|
||||
general-analysis,https://twitter.com/ycombinator
|
||||
giga-ml,https://twitter.com/varunvummadi
|
||||
glade,https://twitter.com/ycombinator
|
||||
glass-health,https://twitter.com/dereckwpaul
|
||||
goodfin,https://twitter.com/ycombinator
|
||||
grai,https://twitter.com/ycombinator
|
||||
greenlite,https://twitter.com/will_lawrenceTO
|
||||
grey,https://www.twitter.com/kingidee
|
||||
happyrobot,https://twitter.com/pablorpalafox
|
||||
haystack-software,https://x.com/AkshaySubr42403
|
||||
health-harbor,https://twitter.com/AlanLiu96
|
||||
healthspark,https://twitter.com/stephengrinich
|
||||
hedgehog-2,https://twitter.com/ycombinator
|
||||
helicone,https://twitter.com/justinstorre
|
||||
heroui,https://x.com/jrgarciadev
|
||||
hoai,https://twitter.com/ycombinator
|
||||
hockeystack,https://twitter.com/ycombinator
|
||||
hokali,https://twitter.com/hokalico
|
||||
homeflow,https://twitter.com/ycombinator
|
||||
hubble-network,https://twitter.com/BenWild10
|
||||
humand,https://twitter.com/nicolasbenenzon
|
||||
humanlayer,https://twitter.com/dexhorthy
|
||||
hydra,https://twitter.com/JoeSciarrino
|
||||
hyperbound,https://twitter.com/sguduguntla
|
||||
ideate-xyz,https://twitter.com/nomocodes
|
||||
inbuild,https://twitter.com/TySharp_iB
|
||||
indexical,https://twitter.com/try_nebula
|
||||
industrial-next,https://twitter.com/ycombinator
|
||||
infisical,https://twitter.com/matsiiako
|
||||
inkeep,https://twitter.com/nickgomezc
|
||||
inlet-2,https://twitter.com/inlet_ai
|
||||
innkeeper,https://twitter.com/tejasybhakta
|
||||
instant,https://twitter.com/JoeAverbukh
|
||||
integrated-reasoning,https://twitter.com/d4r5c2
|
||||
interlock,https://twitter.com/ycombinator
|
||||
intryc,https://x.com/alexmarantelos?lang=en
|
||||
invert,https://twitter.com/purrmin
|
||||
iollo,https://twitter.com/daniel_gomari
|
||||
jamble,https://twitter.com/ycombinator
|
||||
joon-health,https://twitter.com/IsaacVanEaves
|
||||
juicebox,https://twitter.com/davepaffenholz
|
||||
julius,https://twitter.com/0interestrates
|
||||
karmen,https://twitter.com/ycombinator
|
||||
kenley,https://x.com/KenleyAI
|
||||
keylika,https://twitter.com/buddhachaudhuri
|
||||
khoj,https://twitter.com/debanjum
|
||||
kite,https://twitter.com/DerekFeehrer
|
||||
kivo-health,https://twitter.com/vaughnkoch
|
||||
knowtex,https://twitter.com/CarolineCZhang
|
||||
koala,https://twitter.com/studioseinstein?s=11
|
||||
kopra-bio,https://x.com/AF_Haddad
|
||||
kura,https://x.com/kura_labs
|
||||
laminar,https://twitter.com/skull8888888888
|
||||
lancedb,https://twitter.com/changhiskhan
|
||||
latent,https://twitter.com/ycombinator
|
||||
layerup,https://twitter.com/arnavbathla20
|
||||
lazyeditor,https://twitter.com/jee_cash
|
||||
ledgerup,https://twitter.com/josephrjohnson
|
||||
lifelike,https://twitter.com/alecxiang1
|
||||
lighthouz-ai,https://x.com/srijankedia
|
||||
lightski,https://www.twitter.com/hansenq
|
||||
ligo-biosciences,https://x.com/ArdaGoreci/status/1830744265007480934
|
||||
line-build,https://twitter.com/ycombinator
|
||||
lingodotdev,https://twitter.com/maxprilutskiy
|
||||
linkgrep,https://twitter.com/linkgrep
|
||||
linum,https://twitter.com/schopra909
|
||||
livedocs,https://twitter.com/arsalanbashir
|
||||
luca,https://twitter.com/LucaPricingHq
|
||||
lumenary,https://twitter.com/vivekhaz
|
||||
lune,https://x.com/samuelp4rk
|
||||
lynx,https://twitter.com/ycombinator
|
||||
magic-loops,https://twitter.com/jumploops
|
||||
manaflow,https://twitter.com/austinywang
|
||||
mandel-ai,https://twitter.com/shmkkr
|
||||
martin,https://twitter.com/martinvoiceai
|
||||
matano,https://twitter.com/AhmedSamrose
|
||||
mdhub,https://twitter.com/ealamolda
|
||||
mederva-health,http://twitter.com/sabihmir
|
||||
medplum,https://twitter.com/ReshmaKhilnani
|
||||
melty,https://x.com/charliebholtz
|
||||
mem0,https://twitter.com/taranjeetio
|
||||
mercator,https://www.twitter.com/ajdstein
|
||||
mercoa,https://twitter.com/Sarora27
|
||||
meru,https://twitter.com/rohanarora_
|
||||
metalware,https://twitter.com/ryanchowww
|
||||
metriport,https://twitter.com/dimagoncharov_
|
||||
mica-ai,https://twitter.com/ycombinator
|
||||
middleware,https://twitter.com/laduramvishnoi
|
||||
midship,https://twitter.com/_kietay
|
||||
mintlify,https://twitter.com/hanwangio
|
||||
minusx,https://twitter.com/nuwandavek
|
||||
miracle,https://twitter.com/ycombinator
|
||||
miru-ml,https://twitter.com/armelwtalla
|
||||
mito-health,https://twitter.com/teemingchew
|
||||
mocha,https://twitter.com/nichochar
|
||||
modern-realty,https://x.com/RIsanians
|
||||
modulari-t,https://twitter.com/ycombinator
|
||||
mogara,https://twitter.com/ycombinator
|
||||
monterey-ai,https://twitter.com/chunonline
|
||||
moonglow,https://twitter.com/leilavclark
|
||||
moonshine,https://x.com/useMoonshine
|
||||
moreta,https://twitter.com/ycombinator
|
||||
mutable-ai,https://x.com/smahsramo
|
||||
myria,https://twitter.com/reyflemings
|
||||
nango,https://twitter.com/rguldener
|
||||
nanograb,https://twitter.com/lauhoyeung
|
||||
nara,https://twitter.com/join_nara
|
||||
narrative,https://twitter.com/axitkhurana
|
||||
nectar,https://twitter.com/AllenWang314
|
||||
neosync,https://twitter.com/evisdrenova
|
||||
nerve,https://x.com/fortress_build
|
||||
networkocean,https://twitter.com/sammendel4
|
||||
ngrow-ai,https://twitter.com/ycombinator
|
||||
no-cap,https://x.com/nocapso
|
||||
nowadays,https://twitter.com/ycombinator
|
||||
numeral,https://www.twitter.com/mduvall_
|
||||
obento-health,https://twitter.com/ycombinator
|
||||
octopipe,https://twitter.com/abhishekray07
|
||||
odo,https://twitter.com/ycombinator
|
||||
ofone,https://twitter.com/ycombinator
|
||||
onetext,http://twitter.com/jfudem
|
||||
openfunnel,https://x.com/fenilsuchak
|
||||
opensight,https://twitter.com/OpenSightAI
|
||||
ora-ai,https://twitter.com/ryan_rl_phelps
|
||||
orchid,https://twitter.com/ycombinator
|
||||
origami-agents,https://x.com/fin465
|
||||
outerbase,https://www.twitter.com/burcs
|
||||
outerport,https://x.com/yongyuanxi
|
||||
outset,https://twitter.com/AaronLCannon
|
||||
overeasy,https://twitter.com/skyflylu
|
||||
overlap,https://x.com/jbaerofficial
|
||||
oway,https://twitter.com/owayinc
|
||||
ozone,https://twitter.com/maxvwolff
|
||||
pair-ai,https://twitter.com/ycombinator
|
||||
palmier,https://twitter.com/ycombinator
|
||||
panora,https://twitter.com/rflih_
|
||||
parabolic,https://twitter.com/ycombinator
|
||||
paragon-ai,https://twitter.com/ycombinator
|
||||
parahelp,https://twitter.com/ankerbachryhl
|
||||
parity,https://x.com/wilson_spearman
|
||||
parley,https://twitter.com/ycombinator
|
||||
patched,https://x.com/rohan_sood15
|
||||
pearson-labs,https://twitter.com/ycombinator
|
||||
pelm,https://twitter.com/ycombinator
|
||||
penguin-ai,https://twitter.com/ycombinator
|
||||
peoplebox,https://twitter.com/abhichugh
|
||||
permitflow,https://twitter.com/ycombinator
|
||||
permitportal,https://twitter.com/rgmazilu
|
||||
persana-ai,https://www.twitter.com/tweetsreez
|
||||
pharos,https://x.com/felix_brann
|
||||
phind,https://twitter.com/michaelroyzen
|
||||
phonely,https://x.com/phonely_ai
|
||||
pier,https://twitter.com/ycombinator
|
||||
pierre,https://twitter.com/fat
|
||||
pinnacle,https://twitter.com/SeanRoades
|
||||
pipeshift,https://x.com/FerraoEnrique
|
||||
pivot,https://twitter.com/raimietang
|
||||
planbase,https://twitter.com/ycombinator
|
||||
plover-parametrics,https://twitter.com/ycombinator
|
||||
plutis,https://twitter.com/kamil_m_ali
|
||||
poka-labs,https://twitter.com/ycombinator
|
||||
poly,https://twitter.com/Denizen_Kane
|
||||
polymath-robotics,https://twitter.com/stefanesa
|
||||
ponyrun,https://twitter.com/ycombinator
|
||||
poplarml,https://twitter.com/dnaliu17
|
||||
posh,https://twitter.com/PoshElectric
|
||||
power-to-the-brand,https://twitter.com/ycombinator
|
||||
primevault,https://twitter.com/prashantupd
|
||||
prohostai,https://twitter.com/bilguunu
|
||||
promptloop,https://twitter.com/PeterbMangan
|
||||
propaya,https://x.com/PropayaOfficial
|
||||
proper,https://twitter.com/kylemaloney_
|
||||
proprise,https://twitter.com/kragerDev
|
||||
protegee,https://x.com/kirthibanothu
|
||||
pump-co,https://www.twitter.com/spndn07/
|
||||
pumpkin,https://twitter.com/SamuelCrombie
|
||||
pure,https://twitter.com/collectpure
|
||||
pylon-2,https://x.com/marty_kausas
|
||||
pyq-ai,https://twitter.com/araghuvanshi2
|
||||
query-vary,https://twitter.com/DJFinetunes
|
||||
rankai,https://x.com/rankai_ai
|
||||
rastro,https://twitter.com/baptiste_cumin
|
||||
reactwise,https://twitter.com/ycombinator
|
||||
read-bean,https://twitter.com/maggieqzhang
|
||||
readily,https://twitter.com/ycombinator
|
||||
redouble-ai,https://twitter.com/pneumaticdill?s=21
|
||||
refine,https://twitter.com/civanozseyhan
|
||||
reflex,https://twitter.com/getreflex
|
||||
reforged-labs,https://twitter.com/ycombinator
|
||||
relace,https://twitter.com/ycombinator
|
||||
relate,https://twitter.com/chrischae__
|
||||
remade,https://x.com/Christos_antono
|
||||
remy,https://twitter.com/ycombinator
|
||||
remy-2,https://x.com/remysearch
|
||||
rentflow,https://twitter.com/ycombinator
|
||||
requestly,https://twitter.com/sachinjain024
|
||||
resend,https://x.com/zenorocha
|
||||
respaid,https://twitter.com/johnbanr
|
||||
reticular,https://x.com/nithinparsan
|
||||
retrofix-ai,https://twitter.com/danieldoesdev
|
||||
revamp,https://twitter.com/getrevamp_ai
|
||||
revyl,https://x.com/landseerenga
|
||||
reworkd,https://twitter.com/asimdotshrestha
|
||||
reworks,https://twitter.com/ycombinator
|
||||
rift,https://twitter.com/FilipTwarowski
|
||||
riskangle,https://twitter.com/ycombinator
|
||||
riskcube,https://x.com/andrei_risk
|
||||
rivet,https://twitter.com/nicholaskissel
|
||||
riveter-ai,https://x.com/AGrillz
|
||||
roame,https://x.com/timtqin
|
||||
roforco,https://x.com/brain_xiang
|
||||
rome,https://twitter.com/craigzLiszt
|
||||
roomplays,https://twitter.com/criyaco
|
||||
rosebud-biosciences,https://twitter.com/KitchenerWilson
|
||||
rowboat-labs,https://twitter.com/segmenta
|
||||
rubber-ducky-labs,https://twitter.com/alexandraj777
|
||||
ruleset,https://twitter.com/LoganFrederick
|
||||
ryvn,https://x.com/ryvnai
|
||||
safetykit,https://twitter.com/ycombinator
|
||||
sage-ai,https://twitter.com/akhilmurthy20
|
||||
saldor,https://x.com/notblandjacob
|
||||
salient,https://twitter.com/ycombinator
|
||||
schemeflow,https://x.com/browninghere
|
||||
sculpt,https://twitter.com/ycombinator
|
||||
seals-ai,https://x.com/luismariogm
|
||||
seis,https://twitter.com/TrevMcKendrick
|
||||
sensei,https://twitter.com/ycombinator
|
||||
sensorsurf,https://twitter.com/noahjepstein
|
||||
sepal-ai,https://www.twitter.com/katqhu1
|
||||
serial,https://twitter.com/Serialmfg
|
||||
serif-health,https://www.twitter.com/mfrobben
|
||||
serra,https://twitter.com/ycombinator
|
||||
shasta-health,https://twitter.com/SrinjoyMajumdar
|
||||
shekel-mobility,https://twitter.com/ShekelMobility
|
||||
shortbread,https://twitter.com/ShortbreadAI
|
||||
showandtell,https://twitter.com/ycombinator
|
||||
sidenote,https://twitter.com/jclin22009
|
||||
sieve,https://twitter.com/mokshith_v
|
||||
silkchart,https://twitter.com/afakerele
|
||||
simple-ai,https://twitter.com/catheryn_li
|
||||
simplehash,https://twitter.com/Alex_Kilkka
|
||||
simplex,https://x.com/simplexdata
|
||||
simplifine,https://x.com/egekduman
|
||||
sizeless,https://twitter.com/cornelius_einem
|
||||
skyvern,https://x.com/itssuchintan
|
||||
slingshot,https://twitter.com/ycombinator
|
||||
snowpilot,https://x.com/snowpilotai
|
||||
soff,https://x.com/BernhardHausle1
|
||||
solum-health,https://twitter.com/ycombinator
|
||||
sonnet,https://twitter.com/ycombinator
|
||||
sophys,https://twitter.com/ycombinator
|
||||
sorcerer,https://x.com/big_veech
|
||||
soteri-skin,https://twitter.com/SoteriSkin
|
||||
sphere,https://twitter.com/nrudder_
|
||||
spine-ai,https://twitter.com/BudhkarAkshay
|
||||
spongecake,https://twitter.com/ycombinator
|
||||
spur,https://twitter.com/sneha8sivakumar
|
||||
sre-ai,https://twitter.com/ycombinator
|
||||
stably,https://x.com/JinjingLiang
|
||||
stack-ai,https://twitter.com/bernaceituno
|
||||
stellar,https://twitter.com/ycombinator
|
||||
stormy-ai-autonomous-marketing-agent,https://twitter.com/karmedge/
|
||||
strada,https://twitter.com/AmirProd1
|
||||
stream,https://twitter.com/ycombinator
|
||||
structured-labs,https://twitter.com/amruthagujjar
|
||||
studdy,https://twitter.com/mike_lamma
|
||||
subscriptionflow,https://twitter.com/KashifSaleemCEO
|
||||
subsets,https://twitter.com/ycombinator
|
||||
supercontrast,https://twitter.com/ycombinator
|
||||
supertone,https://twitter.com/trysupertone
|
||||
superunit,https://x.com/peter_marler
|
||||
sweep,https://twitter.com/wwzeng1
|
||||
syncly,https://x.com/synclyhq
|
||||
synnax,https://x.com/Emilbon99
|
||||
syntheticfi,https://x.com/SyntheticFi_SF
|
||||
t3-chat-prev-ping-gg,https://twitter.com/t3dotgg
|
||||
tableflow,https://twitter.com/mitchpatin
|
||||
tai,https://twitter.com/Tragen_ai
|
||||
tandem-2,https://x.com/Tandemspace
|
||||
taxgpt,https://twitter.com/ChKashifAli
|
||||
taylor-ai,https://twitter.com/brian_j_kim
|
||||
teamout,https://twitter.com/ycombinator
|
||||
tegon,https://twitter.com/harshithb4h
|
||||
terminal,https://x.com/withterminal
|
||||
theneo,https://twitter.com/robakid
|
||||
theya,https://twitter.com/vikasch
|
||||
thyme,https://twitter.com/ycombinator
|
||||
tiny,https://twitter.com/ycombinator
|
||||
tola,https://twitter.com/alencvisic
|
||||
trainy,https://twitter.com/TrainyAI
|
||||
trendex-we-tokenize-talent,https://twitter.com/ycombinator
|
||||
trueplace,https://twitter.com/ycombinator
|
||||
truewind,https://twitter.com/AlexLee611
|
||||
trusty,https://twitter.com/trustyhomes
|
||||
truva,https://twitter.com/gaurav_aggarwal
|
||||
tuesday,https://twitter.com/kai_jiabo_feng
|
||||
twenty,https://twitter.com/twentycrm
|
||||
twine,https://twitter.com/anandvalavalkar
|
||||
two-dots,https://twitter.com/HensonOrser1
|
||||
typa,https://twitter.com/sounhochung
|
||||
typeless,https://twitter.com/ycombinator
|
||||
unbound,https://twitter.com/ycombinator
|
||||
undermind,https://twitter.com/UndermindAI
|
||||
unison,https://twitter.com/maxim_xyz
|
||||
unlayer,https://twitter.com/adeelraza
|
||||
unstatiq,https://twitter.com/NishSingaraju
|
||||
unusual,https://x.com/willwjack
|
||||
upfront,https://twitter.com/KnowUpfront
|
||||
vaero,https://twitter.com/ycombinator
|
||||
vango-ai,https://twitter.com/vango_ai
|
||||
variance,https://twitter.com/karinemellata
|
||||
variant,https://twitter.com/bnj
|
||||
velos,https://twitter.com/OscarMHBF
|
||||
velt,https://twitter.com/rakesh_goyal
|
||||
vendra,https://x.com/vendraHQ
|
||||
vera-health,https://x.com/_maximall
|
||||
verata,https://twitter.com/ycombinator
|
||||
versive,https://twitter.com/getversive
|
||||
vessel,https://twitter.com/vesselapi
|
||||
vibe,https://twitter.com/ycombinator
|
||||
videogen,https://twitter.com/ycombinator
|
||||
vigilant,https://twitter.com/BenShumaker_
|
||||
vitalize-care,https://twitter.com/nikhiljdsouza
|
||||
viva-labs,https://twitter.com/vishal_the_jain
|
||||
vizly,https://twitter.com/vizlyhq
|
||||
vly-ai-2,https://x.com/victorxheng
|
||||
vocode,https://twitter.com/kianhooshmand
|
||||
void,https://x.com/parel_es
|
||||
voltic,https://twitter.com/ycombinator
|
||||
vooma,https://twitter.com/jessebucks
|
||||
wingback,https://twitter.com/tfriehe_
|
||||
winter,https://twitter.com/AzianMike
|
||||
wolfia,https://twitter.com/narenmano
|
||||
wordware,https://twitter.com/kozerafilip
|
||||
zenbase-ai,https://twitter.com/CyrusOfEden
|
||||
zeropath,https://x.com/zeropathAI
|
||||
|
Some files were not shown because too many files have changed in this diff Show More
Reference in New Issue
Block a user