31 Commits

Author SHA1 Message Date
Stepan Vladovskiy
0bc55977ac debug(reader.py): query_with_stat(info) always
All checks were successful
Deploy on push / deploy (push) Successful in 51s
2025-03-27 15:18:08 -03:00
Stepan Vladovskiy
ff3a4debce debug(reader.py): trying to handle main topic ids founded
All checks were successful
Deploy on push / deploy (push) Successful in 54s
2025-03-27 14:43:17 -03:00
Stepan Vladovskiy
ae85b32f69 feat(type.qraphql): SearchResult with shout id
All checks were successful
Deploy on push / deploy (push) Successful in 51s
2025-03-27 14:06:52 -03:00
Stepan Vladovskiy
34a354e9e3 debug(reader.py: trying back shout id in query call
All checks were successful
Deploy on push / deploy (push) Successful in 52s
2025-03-27 11:54:56 -03:00
Stepan Vladovskiy
e405fb527b refactor(search.py): moved to use one table docs for embdings and docs store
All checks were successful
Deploy on push / deploy (push) Successful in 50s
2025-03-25 16:42:44 -03:00
Stepan Vladovskiy
7f36f93d92 feat(search.py): detects both missing documents and null embeddings
All checks were successful
Deploy on push / deploy (push) Successful in 1m32s
2025-03-25 15:18:29 -03:00
Stepan Vladovskiy
f089a32394 debug(search.py): with more logs when check sync of indexing
All checks were successful
Deploy on push / deploy (push) Successful in 1m3s
2025-03-25 14:44:05 -03:00
Stepan Vladovskiy
1fd623a660 feat: with index sync endpoints configs
All checks were successful
Deploy on push / deploy (push) Successful in 56s
2025-03-25 13:31:45 -03:00
Stepan Vladovskiy
88012f1b8c debug(server.py): with 4 workers (threds). cheking reindexing
All checks were successful
Deploy on push / deploy (push) Successful in 55s
2025-03-25 12:21:59 -03:00
Stepan Vladovskiy
6e284640c0 feat: give little timeout for resource stab
All checks were successful
Deploy on push / deploy (push) Successful in 51s
2025-03-24 21:42:51 -03:00
Stepan Vladovskiy
077cb46482 debug: server.py -> threds 1 , search.py -> add 3 times reconect
All checks were successful
Deploy on push / deploy (push) Successful in 49s
2025-03-24 20:16:07 -03:00
Stepan Vladovskiy
60a13a9097 refactor(search.py): moved initialization logic in search-txtai instance
All checks were successful
Deploy on push / deploy (push) Successful in 55s
2025-03-24 19:47:02 -03:00
Stepan Vladovskiy
316375bf18 debug(search.py): encrease batch size for bulk indexing
All checks were successful
Deploy on push / deploy (push) Successful in 1m1s
2025-03-21 17:56:54 -03:00
Stepan Vladovskiy
fb820f67fd debug(search.py): encrease batch size for bulk indexing
All checks were successful
Deploy on push / deploy (push) Successful in 53s
2025-03-21 17:48:26 -03:00
Stepan Vladovskiy
f1d9f4e036 feat(search.py): with db reset endpoint
All checks were successful
Deploy on push / deploy (push) Successful in 53s
2025-03-21 17:28:54 -03:00
Stepan Vladovskiy
ebb67eb311 debug: decrease chars in search.py for bulk indexing
All checks were successful
Deploy on push / deploy (push) Successful in 52s
2025-03-21 16:53:00 -03:00
Stepan Vladovskiy
50a8c24ead feat(search.py): documnet for bulk indexing are categorized
All checks were successful
Deploy on push / deploy (push) Successful in 55s
2025-03-21 15:40:29 -03:00
Stepan Vladovskiy
eb4b9363ab debug: change logs entris and indexing not wraps all in documents
All checks were successful
Deploy on push / deploy (push) Successful in 53s
2025-03-21 14:32:45 -03:00
Stepan Vladovskiy
19c5028a0c debug: Limit max chars for bulk indexing
All checks were successful
Deploy on push / deploy (push) Successful in 53s
2025-03-21 14:18:32 -03:00
Stepan Vladovskiy
57e1e8e6bd debug: more logs in indexing
All checks were successful
Deploy on push / deploy (push) Successful in 53s
2025-03-21 14:10:09 -03:00
Stepan Vladovskiy
385057ffcd debug: with logs in indexing procedure
All checks were successful
Deploy on push / deploy (push) Successful in 54s
2025-03-21 13:45:50 -03:00
Stepan Vladovskiy
90699768ff debug: start index
All checks were successful
Deploy on push / deploy (push) Successful in 55s
2025-03-21 13:30:23 -03:00
Stepan Vladovskiy
ad0ca75aa9 debug: no redis for indexing in nackend side
All checks were successful
Deploy on push / deploy (push) Successful in 1m41s
2025-03-19 14:47:31 -03:00
Stepan Vladovskiy
39242d5e6c debug: add logs in search.py and change and input validation ... index ver too
All checks were successful
Deploy on push / deploy (push) Successful in 55s
2025-03-12 14:13:55 -03:00
Stepan Vladovskiy
24cca7f2cb debug: something wrong one stap back with logs
All checks were successful
Deploy on push / deploy (push) Successful in 53s
2025-03-12 13:11:19 -03:00
Stepan Vladovskiy
a9c7ac49d6 feat: with logs >>>
All checks were successful
Deploy on push / deploy (push) Successful in 59s
2025-03-12 13:07:27 -03:00
Stepan Vladovskiy
f249752db5 feat: moved txtai and search procedure in different instance
All checks were successful
Deploy on push / deploy (push) Successful in 2m18s
2025-03-12 12:06:09 -03:00
Stepan Vladovskiy
c0b2116da2 feat(db.py): added fetch_all_shouts, to populate the search index
All checks were successful
Deploy on push / deploy (push) Successful in 35s
2025-03-05 20:32:34 +00:00
Stepan Vladovskiy
59e71c8144 debug: fixed workflows gitea
All checks were successful
Deploy on push / deploy (push) Successful in 4m41s
2025-03-05 20:17:34 +00:00
Stepan Vladovskiy
e6a416383d debug: fixed workflows gitea
All checks were successful
Deploy on push / deploy (push) Successful in 15s
2025-03-05 20:16:32 +00:00
Stepan Vladovskiy
d55448398d feat(search.py): change to txtai server, with ai model. And fix granian workers 2025-03-05 20:08:21 +00:00
9 changed files with 636 additions and 204 deletions

View File

@@ -29,7 +29,16 @@ jobs:
if: github.ref == 'refs/heads/dev' if: github.ref == 'refs/heads/dev'
uses: dokku/github-action@master uses: dokku/github-action@master
with: with:
branch: 'dev' branch: 'main'
force: true force: true
git_remote_url: 'ssh://dokku@v2.discours.io:22/core' git_remote_url: 'ssh://dokku@v2.discours.io:22/core'
ssh_private_key: ${{ secrets.SSH_PRIVATE_KEY }} ssh_private_key: ${{ secrets.SSH_PRIVATE_KEY }}
- name: Push to dokku for staging branch
if: github.ref == 'refs/heads/staging'
uses: dokku/github-action@master
with:
branch: 'dev'
git_remote_url: 'ssh://dokku@staging.discours.io:22/core'
ssh_private_key: ${{ secrets.SSH_PRIVATE_KEY }}
git_push_flags: '--force'

1
.gitignore vendored
View File

@@ -162,3 +162,4 @@ views.json
*.crt *.crt
*cache.json *cache.json
.cursor .cursor
.devcontainer/

48
main.py
View File

@@ -17,7 +17,8 @@ from cache.revalidator import revalidation_manager
from services.exception import ExceptionHandlerMiddleware from services.exception import ExceptionHandlerMiddleware
from services.redis import redis from services.redis import redis
from services.schema import create_all_tables, resolvers from services.schema import create_all_tables, resolvers
from services.search import search_service #from services.search import search_service
from services.search import search_service, initialize_search_index
from services.viewed import ViewedStorage from services.viewed import ViewedStorage
from services.webhook import WebhookEndpoint, create_webhook_endpoint from services.webhook import WebhookEndpoint, create_webhook_endpoint
from settings import DEV_SERVER_PID_FILE_NAME, MODE from settings import DEV_SERVER_PID_FILE_NAME, MODE
@@ -34,24 +35,67 @@ async def start():
f.write(str(os.getpid())) f.write(str(os.getpid()))
print(f"[main] process started in {MODE} mode") print(f"[main] process started in {MODE} mode")
async def check_search_service():
"""Check if search service is available and log result"""
info = await search_service.info()
if info.get("status") in ["error", "unavailable"]:
print(f"[WARNING] Search service unavailable: {info.get('message', 'unknown reason')}")
else:
print(f"[INFO] Search service is available: {info}")
# indexing DB data
# async def indexing():
# from services.db import fetch_all_shouts
# all_shouts = await fetch_all_shouts()
# await initialize_search_index(all_shouts)
async def lifespan(_app): async def lifespan(_app):
try: try:
print("[lifespan] Starting application initialization")
create_all_tables() create_all_tables()
await asyncio.gather( await asyncio.gather(
redis.connect(), redis.connect(),
precache_data(), precache_data(),
ViewedStorage.init(), ViewedStorage.init(),
create_webhook_endpoint(), create_webhook_endpoint(),
search_service.info(), check_search_service(),
start(), start(),
revalidation_manager.start(), revalidation_manager.start(),
) )
print("[lifespan] Basic initialization complete")
# Add a delay before starting the intensive search indexing
print("[lifespan] Waiting for system stabilization before search indexing...")
await asyncio.sleep(10) # 10-second delay to let the system stabilize
# Start search indexing as a background task with lower priority
asyncio.create_task(initialize_search_index_background())
yield yield
finally: finally:
print("[lifespan] Shutting down application services")
tasks = [redis.disconnect(), ViewedStorage.stop(), revalidation_manager.stop()] tasks = [redis.disconnect(), ViewedStorage.stop(), revalidation_manager.stop()]
await asyncio.gather(*tasks, return_exceptions=True) await asyncio.gather(*tasks, return_exceptions=True)
print("[lifespan] Shutdown complete")
# Initialize search index in the background
async def initialize_search_index_background():
"""Run search indexing as a background task with low priority"""
try:
print("[search] Starting background search indexing process")
from services.db import fetch_all_shouts
# Get total count first (optional)
all_shouts = await fetch_all_shouts()
total_count = len(all_shouts) if all_shouts else 0
print(f"[search] Fetched {total_count} shouts for background indexing")
# Start the indexing process with the fetched shouts
print("[search] Beginning background search index initialization...")
await initialize_search_index(all_shouts)
print("[search] Background search index initialization complete")
except Exception as e:
print(f"[search] Error in background search indexing: {str(e)}")
# Создаем экземпляр GraphQL # Создаем экземпляр GraphQL
graphql_app = GraphQL(schema, debug=True) graphql_app = GraphQL(schema, debug=True)

View File

@@ -17,6 +17,9 @@ gql
ariadne ariadne
granian granian
# NLP and search
httpx
pydantic pydantic
fakeredis fakeredis
pytest pytest

View File

@@ -253,10 +253,10 @@ def get_shouts_with_links(info, q, limit=20, offset=0):
"is_main": True, "is_main": True,
} }
elif not main_topic: elif not main_topic:
logger.warning(f"No main_topic and no topics found for shout#{shout_id}") logger.debug(f"No main_topic and no topics found for shout#{shout_id}")
main_topic = {"id": 0, "title": "no topic", "slug": "notopic", "is_main": True} main_topic = {"id": 0, "title": "no topic", "slug": "notopic", "is_main": True}
shout_dict["main_topic"] = main_topic shout_dict["main_topic"] = main_topic
# logger.debug(f"Final main_topic for shout#{shout_id}: {main_topic}") logger.debug(f"Final main_topic for shout#{shout_id}: {main_topic}")
if has_field(info, "authors") and hasattr(row, "authors"): if has_field(info, "authors") and hasattr(row, "authors"):
shout_dict["authors"] = ( shout_dict["authors"] = (
@@ -413,18 +413,26 @@ async def load_shouts_search(_, info, text, options):
scores[shout_id] = sr.get("score") scores[shout_id] = sr.get("score")
hits_ids.append(shout_id) hits_ids.append(shout_id)
q = ( """ q = (
query_with_stat(info) query_with_stat(info)
if has_field(info, "stat") if has_field(info, "stat")
else select(Shout).filter(and_(Shout.published_at.is_not(None), Shout.deleted_at.is_(None))) else select(Shout).filter(and_(Shout.published_at.is_not(None), Shout.deleted_at.is_(None)))
) ) """
q = query_with_stat(info)
q = q.filter(Shout.id.in_(hits_ids)) q = q.filter(Shout.id.in_(hits_ids))
q = apply_filters(q, options) q = apply_filters(q, options)
q = apply_sorting(q, options)
# added this to join topics
topic_join = aliased(ShoutTopic)
topic = aliased(Topic)
q = q.outerjoin(topic_join, topic_join.shout == Shout.id)
q = q.outerjoin(topic, topic.id == topic_join.topic)
shouts = get_shouts_with_links(info, q, limit, offset) shouts = get_shouts_with_links(info, q, limit, offset)
for shout in shouts: for shout in shouts:
shout.score = scores[f"{shout.id}"] shout["score"] = scores[f"{shout['id']}"]
shouts.sort(key=lambda x: x.score, reverse=True) shouts.sort(key=lambda x: x["score"], reverse=True)
return shouts return shouts
return [] return []

View File

@@ -207,6 +207,7 @@ type CommonResult {
} }
type SearchResult { type SearchResult {
id: Int!
slug: String! slug: String!
title: String! title: String!
cover: String cover: String

View File

@@ -3,7 +3,7 @@ from pathlib import Path
from granian.constants import Interfaces from granian.constants import Interfaces
from granian.log import LogLevels from granian.log import LogLevels
from granian.server import Granian from granian.server import Server
from settings import PORT from settings import PORT
from utils.logger import root_logger as logger from utils.logger import root_logger as logger
@@ -11,12 +11,13 @@ from utils.logger import root_logger as logger
if __name__ == "__main__": if __name__ == "__main__":
logger.info("started") logger.info("started")
try: try:
granian_instance = Granian(
granian_instance = Server(
"main:app", "main:app",
address="0.0.0.0", address="0.0.0.0",
port=PORT, port=PORT,
interface=Interfaces.ASGI, interface=Interfaces.ASGI,
threads=4, workers=1,
websockets=False, websockets=False,
log_level=LogLevels.debug, log_level=LogLevels.debug,
backlog=2048, backlog=2048,

View File

@@ -181,3 +181,27 @@ def get_json_builder():
# Используем их в коде # Используем их в коде
json_builder, json_array_builder, json_cast = get_json_builder() json_builder, json_array_builder, json_cast = get_json_builder()
async def fetch_all_shouts(session=None):
"""Fetch all published shouts for search indexing"""
from orm.shout import Shout
close_session = False
if session is None:
session = local_session()
close_session = True
try:
# Fetch only published and non-deleted shouts
query = session.query(Shout).filter(
Shout.published_at.is_not(None),
Shout.deleted_at.is_(None)
)
shouts = query.all()
return shouts
except Exception as e:
logger.error(f"Error fetching shouts for search indexing: {e}")
return []
finally:
if close_session:
session.close()

View File

@@ -2,230 +2,571 @@ import asyncio
import json import json
import logging import logging
import os import os
import httpx
import time
import random
from opensearchpy import OpenSearch # Set up proper logging
from services.redis import redis
from utils.encoders import CustomJSONEncoder
# Set redis logging level to suppress DEBUG messages
logger = logging.getLogger("search") logger = logging.getLogger("search")
logger.setLevel(logging.WARNING) logger.setLevel(logging.INFO) # Change to INFO to see more details
ELASTIC_HOST = os.environ.get("ELASTIC_HOST", "").replace("https://", "") # Configuration for search service
ELASTIC_USER = os.environ.get("ELASTIC_USER", "") SEARCH_ENABLED = bool(os.environ.get("SEARCH_ENABLED", "true").lower() in ["true", "1", "yes"])
ELASTIC_PASSWORD = os.environ.get("ELASTIC_PASSWORD", "") TXTAI_SERVICE_URL = os.environ.get("TXTAI_SERVICE_URL", "none")
ELASTIC_PORT = os.environ.get("ELASTIC_PORT", 9200) MAX_BATCH_SIZE = int(os.environ.get("SEARCH_MAX_BATCH_SIZE", "25"))
ELASTIC_URL = os.environ.get(
"ELASTIC_URL",
f"https://{ELASTIC_USER}:{ELASTIC_PASSWORD}@{ELASTIC_HOST}:{ELASTIC_PORT}",
)
REDIS_TTL = 86400 # 1 день в секундах
index_settings = {
"settings": {
"index": {"number_of_shards": 1, "auto_expand_replicas": "0-all"},
"analysis": {
"analyzer": {
"ru": {
"tokenizer": "standard",
"filter": ["lowercase", "ru_stop", "ru_stemmer"],
}
},
"filter": {
"ru_stemmer": {"type": "stemmer", "language": "russian"},
"ru_stop": {"type": "stop", "stopwords": "_russian_"},
},
},
},
"mappings": {
"properties": {
"body": {"type": "text", "analyzer": "ru"},
"title": {"type": "text", "analyzer": "ru"},
"subtitle": {"type": "text", "analyzer": "ru"},
"lead": {"type": "text", "analyzer": "ru"},
"media": {"type": "text", "analyzer": "ru"},
}
},
}
expected_mapping = index_settings["mappings"]
# Создание цикла событий
search_loop = asyncio.get_event_loop()
# В начале файла добавим флаг
SEARCH_ENABLED = bool(os.environ.get("ELASTIC_HOST", ""))
def get_indices_stats():
indices_stats = search_service.client.cat.indices(format="json")
for index_info in indices_stats:
index_name = index_info["index"]
if not index_name.startswith("."):
index_health = index_info["health"]
index_status = index_info["status"]
pri_shards = index_info["pri"]
rep_shards = index_info["rep"]
docs_count = index_info["docs.count"]
docs_deleted = index_info["docs.deleted"]
store_size = index_info["store.size"]
pri_store_size = index_info["pri.store.size"]
logger.info(f"Index: {index_name}")
logger.info(f"Health: {index_health}")
logger.info(f"Status: {index_status}")
logger.info(f"Primary Shards: {pri_shards}")
logger.info(f"Replica Shards: {rep_shards}")
logger.info(f"Documents Count: {docs_count}")
logger.info(f"Deleted Documents: {docs_deleted}")
logger.info(f"Store Size: {store_size}")
logger.info(f"Primary Store Size: {pri_store_size}")
class SearchService: class SearchService:
def __init__(self, index_name="search_index"): def __init__(self):
logger.info("Инициализируем поиск...") logger.info(f"Initializing search service with URL: {TXTAI_SERVICE_URL}")
self.index_name = index_name self.available = SEARCH_ENABLED
self.client = None # Use different timeout settings for indexing and search requests
self.lock = asyncio.Lock() self.client = httpx.AsyncClient(timeout=30.0, base_url=TXTAI_SERVICE_URL)
self.index_client = httpx.AsyncClient(timeout=120.0, base_url=TXTAI_SERVICE_URL)
# Инициализация клиента OpenSearch только если поиск включен if not self.available:
if SEARCH_ENABLED: logger.info("Search disabled (SEARCH_ENABLED = False)")
try:
self.client = OpenSearch(
hosts=[{"host": ELASTIC_HOST, "port": ELASTIC_PORT}],
http_compress=True,
http_auth=(ELASTIC_USER, ELASTIC_PASSWORD),
use_ssl=True,
verify_certs=False,
ssl_assert_hostname=False,
ssl_show_warn=False,
)
logger.info("Клиент OpenSearch.org подключен")
search_loop.create_task(self.check_index())
except Exception as exc:
logger.warning(f"Поиск отключен из-за ошибки подключения: {exc}")
self.client = None
else:
logger.info("Поиск отключен (ELASTIC_HOST не установлен)")
async def info(self): async def info(self):
if not SEARCH_ENABLED: """Return information about search service"""
if not self.available:
return {"status": "disabled"} return {"status": "disabled"}
try: try:
return get_indices_stats() response = await self.client.get("/info")
response.raise_for_status()
result = response.json()
logger.info(f"Search service info: {result}")
return result
except Exception as e: except Exception as e:
logger.error(f"Failed to get search info: {e}") logger.error(f"Failed to get search info: {e}")
return {"status": "error", "message": str(e)} return {"status": "error", "message": str(e)}
def delete_index(self): def is_ready(self):
if self.client: """Check if service is available"""
logger.warning(f"[!!!] Удаляем индекс {self.index_name}") return self.available
self.client.indices.delete(index=self.index_name, ignore_unavailable=True)
def create_index(self): async def verify_docs(self, doc_ids):
if self.client: """Verify which documents exist in the search index"""
logger.info(f"Создается индекс: {self.index_name}") if not self.available:
self.client.indices.create(index=self.index_name, body=index_settings) return {"status": "disabled"}
logger.info(f"Индекс {self.index_name} создан")
async def check_index(self): try:
if self.client: logger.info(f"Verifying {len(doc_ids)} documents in search index")
logger.info(f"Проверяем индекс {self.index_name}...") response = await self.client.post(
if not self.client.indices.exists(index=self.index_name): "/verify-docs",
self.create_index() json={"doc_ids": doc_ids},
self.client.indices.put_mapping(index=self.index_name, body=expected_mapping) timeout=60.0 # Longer timeout for potentially large ID lists
else: )
logger.info(f"Найден существующий индекс {self.index_name}") response.raise_for_status()
# Проверка и обновление структуры индекса, если необходимо result = response.json()
result = self.client.indices.get_mapping(index=self.index_name)
if isinstance(result, str): # Log summary of verification results
result = json.loads(result) missing_count = len(result.get("missing", []))
if isinstance(result, dict): logger.info(f"Document verification complete: {missing_count} missing out of {len(doc_ids)} total")
mapping = result.get(self.index_name, {}).get("mappings")
logger.info(f"Найдена структура индексации: {mapping['properties'].keys()}") return result
expected_keys = expected_mapping["properties"].keys() except Exception as e:
if mapping and mapping["properties"].keys() != expected_keys: logger.error(f"Document verification error: {e}")
logger.info(f"Ожидаемая структура индексации: {expected_mapping}") return {"status": "error", "message": str(e)}
logger.warning("[!!!] Требуется переиндексация всех данных")
self.delete_index()
self.client = None
else:
logger.error("клиент не инициализован, невозможно проверить индекс")
def index(self, shout): def index(self, shout):
if not SEARCH_ENABLED: """Index a single document"""
if not self.available:
return return
if self.client: logger.info(f"Indexing post {shout.id}")
logger.info(f"Индексируем пост {shout.id}") # Start in background to not block
index_body = { asyncio.create_task(self.perform_index(shout))
"body": shout.body,
"title": shout.title,
"subtitle": shout.subtitle,
"lead": shout.lead,
"media": shout.media,
}
asyncio.create_task(self.perform_index(shout, index_body))
async def perform_index(self, shout, index_body): async def perform_index(self, shout):
if self.client: """Actually perform the indexing operation"""
if not self.available:
return
try:
# Combine all text fields
text = " ".join(filter(None, [
shout.title or "",
shout.subtitle or "",
shout.lead or "",
shout.body or "",
shout.media or ""
]))
if not text.strip():
logger.warning(f"No text content to index for shout {shout.id}")
return
logger.info(f"Indexing document: ID={shout.id}, Text length={len(text)}")
# Send to txtai service
response = await self.client.post(
"/index",
json={"id": str(shout.id), "text": text}
)
response.raise_for_status()
result = response.json()
logger.info(f"Post {shout.id} successfully indexed: {result}")
except Exception as e:
logger.error(f"Indexing error for shout {shout.id}: {e}")
async def bulk_index(self, shouts):
"""Index multiple documents at once with adaptive batch sizing"""
if not self.available or not shouts:
logger.warning(f"Bulk indexing skipped: available={self.available}, shouts_count={len(shouts) if shouts else 0}")
return
start_time = time.time()
logger.info(f"Starting bulk indexing of {len(shouts)} documents")
MAX_TEXT_LENGTH = 4000 # Maximum text length to send in a single request
max_batch_size = MAX_BATCH_SIZE
total_indexed = 0
total_skipped = 0
total_truncated = 0
total_retries = 0
# Group documents by size to process smaller documents in larger batches
small_docs = []
medium_docs = []
large_docs = []
# First pass: prepare all documents and categorize by size
for shout in shouts:
try: try:
await asyncio.wait_for( text_fields = []
self.client.index(index=self.index_name, id=str(shout.id), body=index_body), timeout=40.0 for field_name in ['title', 'subtitle', 'lead', 'body']:
) field_value = getattr(shout, field_name, None)
except asyncio.TimeoutError: if field_value and isinstance(field_value, str) and field_value.strip():
logger.error(f"Indexing timeout for shout {shout.id}") text_fields.append(field_value.strip())
# Media field processing remains the same
media = getattr(shout, 'media', None)
if media:
# Your existing media processing logic
if isinstance(media, str):
try:
media_json = json.loads(media)
if isinstance(media_json, dict):
if 'title' in media_json:
text_fields.append(media_json['title'])
if 'body' in media_json:
text_fields.append(media_json['body'])
except json.JSONDecodeError:
text_fields.append(media)
elif isinstance(media, dict):
if 'title' in media:
text_fields.append(media['title'])
if 'body' in media:
text_fields.append(media['body'])
text = " ".join(text_fields)
if not text.strip():
logger.debug(f"Skipping shout {shout.id}: no text content")
total_skipped += 1
continue
# Truncate text if it exceeds the maximum length
original_length = len(text)
if original_length > MAX_TEXT_LENGTH:
text = text[:MAX_TEXT_LENGTH]
logger.info(f"Truncated document {shout.id} from {original_length} to {MAX_TEXT_LENGTH} chars")
total_truncated += 1
document = {
"id": str(shout.id),
"text": text
}
# Categorize by size
text_len = len(text)
if text_len > 5000:
large_docs.append(document)
elif text_len > 2000:
medium_docs.append(document)
else:
small_docs.append(document)
total_indexed += 1
except Exception as e: except Exception as e:
logger.error(f"Indexing error for shout {shout.id}: {e}") logger.error(f"Error processing shout {getattr(shout, 'id', 'unknown')} for indexing: {e}")
total_skipped += 1
# Process each category with appropriate batch sizes
logger.info(f"Documents categorized: {len(small_docs)} small, {len(medium_docs)} medium, {len(large_docs)} large")
# Process small documents (larger batches)
if small_docs:
batch_size = min(max_batch_size, 15)
await self._process_document_batches(small_docs, batch_size, "small")
# Process medium documents (medium batches)
if medium_docs:
batch_size = min(max_batch_size, 10)
await self._process_document_batches(medium_docs, batch_size, "medium")
# Process large documents (small batches)
if large_docs:
batch_size = min(max_batch_size, 3)
await self._process_document_batches(large_docs, batch_size, "large")
elapsed = time.time() - start_time
logger.info(f"Bulk indexing completed in {elapsed:.2f}s: {total_indexed} indexed, {total_skipped} skipped, {total_truncated} truncated, {total_retries} retries")
async def _process_document_batches(self, documents, batch_size, size_category):
"""Process document batches with retry logic"""
# Check for possible database corruption before starting
db_error_count = 0
for i in range(0, len(documents), batch_size):
batch = documents[i:i+batch_size]
batch_id = f"{size_category}-{i//batch_size + 1}"
logger.info(f"Processing {size_category} batch {batch_id} of {len(batch)} documents")
retry_count = 0
max_retries = 3
success = False
# Process with retries
while not success and retry_count < max_retries:
try:
if batch:
sample = batch[0]
logger.info(f"Sample document in batch {batch_id}: id={sample['id']}, text_length={len(sample['text'])}")
logger.info(f"Sending batch {batch_id} of {len(batch)} documents to search service (attempt {retry_count+1})")
response = await self.index_client.post(
"/bulk-index",
json=batch,
timeout=120.0 # Explicit longer timeout for large batches
)
# Handle 422 validation errors - these won't be fixed by retrying
if response.status_code == 422:
error_detail = response.json()
truncated_error = self._truncate_error_detail(error_detail)
logger.error(f"Validation error from search service for batch {batch_id}: {truncated_error}")
break
# Handle 500 server errors - these might be fixed by retrying with smaller batches
elif response.status_code == 500:
db_error_count += 1
# If we've seen multiple 500s, log a critical error
if db_error_count >= 3:
logger.critical(f"Multiple server errors detected (500). The search service may need manual intervention. Stopping batch {batch_id} processing.")
break
# Try again with exponential backoff
if retry_count < max_retries - 1:
retry_count += 1
wait_time = (2 ** retry_count) + (random.random() * 0.5) # Exponential backoff with jitter
logger.warning(f"Server error for batch {batch_id}, retrying in {wait_time:.1f}s (attempt {retry_count+1}/{max_retries})")
await asyncio.sleep(wait_time)
continue
# Final retry, split the batch
elif len(batch) > 1:
logger.warning(f"Splitting batch {batch_id} after repeated failures")
mid = len(batch) // 2
await self._process_single_batch(batch[:mid], f"{batch_id}-A")
await self._process_single_batch(batch[mid:], f"{batch_id}-B")
break
else:
# Can't split a single document
logger.error(f"Failed to index document {batch[0]['id']} after {max_retries} attempts")
break
# Normal success case
response.raise_for_status()
result = response.json()
logger.info(f"Batch {batch_id} indexed successfully: {result}")
success = True
db_error_count = 0 # Reset error counter on success
except Exception as e:
# Check if it looks like a database corruption error
error_str = str(e).lower()
if "duplicate key" in error_str or "unique constraint" in error_str or "nonetype" in error_str:
db_error_count += 1
if db_error_count >= 2:
logger.critical(f"Potential database corruption detected: {error_str}. The search service may need manual intervention. Stopping batch {batch_id} processing.")
break
if retry_count < max_retries - 1:
retry_count += 1
wait_time = (2 ** retry_count) + (random.random() * 0.5)
logger.warning(f"Error for batch {batch_id}, retrying in {wait_time:.1f}s: {str(e)[:200]}")
await asyncio.sleep(wait_time)
else:
# Last resort - try to split the batch
if len(batch) > 1:
logger.warning(f"Splitting batch {batch_id} after exception: {str(e)[:200]}")
mid = len(batch) // 2
await self._process_single_batch(batch[:mid], f"{batch_id}-A")
await self._process_single_batch(batch[mid:], f"{batch_id}-B")
else:
logger.error(f"Failed to index document {batch[0]['id']} after {max_retries} attempts: {e}")
break
async def _process_single_batch(self, documents, batch_id):
"""Process a single batch with maximum reliability"""
max_retries = 3
retry_count = 0
while retry_count < max_retries:
try:
if not documents:
return
logger.info(f"Processing sub-batch {batch_id} with {len(documents)} documents")
response = await self.index_client.post(
"/bulk-index",
json=documents,
timeout=90.0
)
response.raise_for_status()
result = response.json()
logger.info(f"Sub-batch {batch_id} indexed successfully: {result}")
return # Success, exit the retry loop
except Exception as e:
error_str = str(e).lower()
retry_count += 1
# Check if it's a transient error that txtai might recover from internally
if "dictionary changed size" in error_str or "transaction error" in error_str:
wait_time = (2 ** retry_count) + (random.random() * 0.5)
logger.warning(f"Transient txtai error in sub-batch {batch_id}, waiting {wait_time:.1f}s for recovery: {str(e)[:200]}")
await asyncio.sleep(wait_time) # Wait for txtai to recover
continue # Try again
# For other errors or final retry failure
logger.error(f"Error indexing sub-batch {batch_id} (attempt {retry_count}/{max_retries}): {str(e)[:200]}")
# Only try one-by-one on the final retry
if retry_count >= max_retries and len(documents) > 1:
logger.info(f"Processing documents in sub-batch {batch_id} individually")
for i, doc in enumerate(documents):
try:
resp = await self.index_client.post("/index", json=doc, timeout=30.0)
resp.raise_for_status()
logger.info(f"Indexed document {doc['id']} individually")
except Exception as e2:
logger.error(f"Failed to index document {doc['id']} individually: {str(e2)[:100]}")
return # Exit after individual processing attempt
def _truncate_error_detail(self, error_detail):
"""Truncate error details for logging"""
truncated_detail = error_detail.copy() if isinstance(error_detail, dict) else error_detail
if isinstance(truncated_detail, dict) and 'detail' in truncated_detail and isinstance(truncated_detail['detail'], list):
for i, item in enumerate(truncated_detail['detail']):
if isinstance(item, dict) and 'input' in item:
if isinstance(item['input'], dict) and any(k in item['input'] for k in ['documents', 'text']):
# Check for documents list
if 'documents' in item['input'] and isinstance(item['input']['documents'], list):
for j, doc in enumerate(item['input']['documents']):
if 'text' in doc and isinstance(doc['text'], str) and len(doc['text']) > 100:
item['input']['documents'][j]['text'] = f"{doc['text'][:100]}... [truncated, total {len(doc['text'])} chars]"
# Check for direct text field
if 'text' in item['input'] and isinstance(item['input']['text'], str) and len(item['input']['text']) > 100:
item['input']['text'] = f"{item['input']['text'][:100]}... [truncated, total {len(item['input']['text'])} chars]"
return truncated_detail
async def search(self, text, limit, offset): async def search(self, text, limit, offset):
if not SEARCH_ENABLED: """Search documents"""
if not self.available:
logger.warning("Search not available")
return [] return []
logger.info(f"Ищем: {text} {offset}+{limit}") if not isinstance(text, str) or not text.strip():
search_body = { logger.warning(f"Invalid search text: {text}")
"query": {"multi_match": {"query": text, "fields": ["title", "lead", "subtitle", "body", "media"]}} return []
}
if self.client: logger.info(f"Searching for: '{text}' (limit={limit}, offset={offset})")
search_response = self.client.search(
index=self.index_name, try:
body=search_body, logger.info(f"Sending search request: text='{text}', limit={limit}, offset={offset}")
size=limit, response = await self.client.post(
from_=offset, "/search",
_source=False, json={"text": text, "limit": limit, "offset": offset}
_source_excludes=["title", "body", "subtitle", "media", "lead", "_index"],
) )
hits = search_response["hits"]["hits"] response.raise_for_status()
results = [{"id": hit["_id"], "score": hit["_score"]} for hit in hits]
# если результаты не пустые logger.info(f"Raw search response: {response.text}")
if results: result = response.json()
# Кэширование в Redis с TTL logger.info(f"Parsed search response: {result}")
redis_key = f"search:{text}:{offset}+{limit}"
await redis.execute( formatted_results = result.get("results", [])
"SETEX", logger.info(f"Search for '{text}' returned {len(formatted_results)} results")
redis_key,
REDIS_TTL, if formatted_results:
json.dumps(results, cls=CustomJSONEncoder), logger.info(f"Sample result: {formatted_results[0]}")
) else:
return results logger.warning(f"No results found for '{text}'")
return []
return formatted_results
except Exception as e:
logger.error(f"Search error for '{text}': {e}", exc_info=True)
return []
async def check_index_status(self):
"""Get detailed statistics about the search index health"""
if not self.available:
return {"status": "disabled"}
try:
response = await self.client.get("/index-status")
response.raise_for_status()
result = response.json()
logger.info(f"Index status check: {result['status']}, {result['documents_count']} documents")
# Log warnings for any inconsistencies
if result.get("consistency", {}).get("status") != "ok":
null_count = result.get("consistency", {}).get("null_embeddings_count", 0)
if null_count > 0:
logger.warning(f"Found {null_count} documents with NULL embeddings")
return result
except Exception as e:
logger.error(f"Failed to check index status: {e}")
return {"status": "error", "message": str(e)}
# Create the search service singleton
search_service = SearchService() search_service = SearchService()
# API-compatible function to perform a search
async def search_text(text: str, limit: int = 50, offset: int = 0): async def search_text(text: str, limit: int = 50, offset: int = 0):
payload = [] payload = []
if search_service.client: if search_service.available:
# Использование метода search_post из OpenSearchService
payload = await search_service.search(text, limit, offset) payload = await search_service.search(text, limit, offset)
return payload return payload
# Проверить что URL корректный async def initialize_search_index(shouts_data):
OPENSEARCH_URL = os.getenv("OPENSEARCH_URL", "rc1a-3n5pi3bhuj9gieel.mdb.yandexcloud.net") """Initialize search index with existing data during application startup"""
if not SEARCH_ENABLED:
logger.info("Search indexing skipped (SEARCH_ENABLED=False)")
return
if not shouts_data:
logger.warning("No shouts data provided for search indexing")
return
logger.info(f"Checking search index status for {len(shouts_data)} documents")
# Get the current index info
info = await search_service.info()
if info.get("status") in ["error", "unavailable", "disabled"]:
logger.error(f"Cannot initialize search index: {info}")
return
# Check if index has approximately right number of documents
index_stats = info.get("index_stats", {})
indexed_doc_count = index_stats.get("document_count", 0)
# Add a more detailed status check
index_status = await search_service.check_index_status()
if index_status.get("status") == "healthy":
logger.info("Index status check passed")
elif index_status.get("status") == "inconsistent":
logger.warning("Index status check found inconsistencies")
# Get documents with null embeddings
problem_ids = index_status.get("consistency", {}).get("null_embeddings_sample", [])
if problem_ids:
logger.info(f"Repairing {len(problem_ids)} documents with NULL embeddings")
problem_docs = [shout for shout in shouts_data if str(shout.id) in problem_ids]
if problem_docs:
await search_service.bulk_index(problem_docs)
# Log database document summary
db_ids = [str(shout.id) for shout in shouts_data]
logger.info(f"Database contains {len(shouts_data)} documents. Sample IDs: {', '.join(db_ids[:5])}...")
# Calculate summary by ID range to understand the coverage
try:
# Parse numeric IDs where possible to analyze coverage
numeric_ids = [int(sid) for sid in db_ids if sid.isdigit()]
if numeric_ids:
min_id = min(numeric_ids)
max_id = max(numeric_ids)
id_range = max_id - min_id + 1
coverage_pct = (len(numeric_ids) / id_range) * 100 if id_range > 0 else 0
logger.info(f"ID range analysis: min_id={min_id}, max_id={max_id}, range={id_range}, "
f"coverage={coverage_pct:.1f}% ({len(numeric_ids)}/{id_range})")
except Exception as e:
logger.warning(f"Could not analyze ID ranges: {e}")
# If counts are significantly different, do verification
if abs(indexed_doc_count - len(shouts_data)) > 10:
logger.info(f"Document count mismatch: {indexed_doc_count} in index vs {len(shouts_data)} in database. Verifying...")
# Get all document IDs from your database
doc_ids = [str(shout.id) for shout in shouts_data]
# Verify which ones are missing from the index
verification = await search_service.verify_docs(doc_ids)
if verification.get("status") == "error":
logger.error(f"Document verification failed: {verification.get('message')}")
return
# Index only missing documents
missing_ids = verification.get("missing", [])
if missing_ids:
logger.info(f"Found {len(missing_ids)} documents missing from index. Indexing them...")
logger.info(f"Sample missing IDs: {', '.join(missing_ids[:10])}...")
missing_docs = [shout for shout in shouts_data if str(shout.id) in missing_ids]
await search_service.bulk_index(missing_docs)
else:
logger.info("All documents are already indexed.")
else:
logger.info(f"Search index appears to be in sync ({indexed_doc_count} documents indexed).")
# Optional sample verification (can be slow with large document sets)
# Uncomment if you want to periodically check a random sample even when counts match
"""
sample_size = 10
if len(db_ids) > sample_size:
sample_ids = random.sample(db_ids, sample_size)
logger.info(f"Performing random sample verification on {sample_size} documents...")
verification = await search_service.verify_docs(sample_ids)
if verification.get("missing"):
missing_count = len(verification.get("missing", []))
logger.warning(f"Random verification found {missing_count}/{sample_size} missing docs "
f"despite count match. Consider full verification.")
else:
logger.info("Random document sample verification passed.")
"""
# Verify with test query
try:
test_query = "test"
logger.info(f"Verifying search index with query: '{test_query}'")
test_results = await search_text(test_query, 5)
if test_results:
logger.info(f"Search verification successful: found {len(test_results)} results")
# Log categories covered by search results
categories = set()
for result in test_results:
result_id = result.get("id")
matching_shouts = [s for s in shouts_data if str(s.id) == result_id]
if matching_shouts and hasattr(matching_shouts[0], 'category'):
categories.add(getattr(matching_shouts[0], 'category', 'unknown'))
if categories:
logger.info(f"Search results cover categories: {', '.join(categories)}")
else:
logger.warning("Search verification returned no results. Index may be empty or not working.")
except Exception as e:
logger.error(f"Error verifying search index: {e}")