Profile avatar
taidesu.bsky.social
Open Source enthusiast and builder of search systems. Currently building search at GitHub. Former DA for open-source OpenSearch @AWS. Opinions are my own.
234 posts 191 followers 93 following
Regular Contributor
Active Commenter
comment in response to post
Elasticsearch doesn’t like having its shards messed with and at this point we need to stop pretending we can control it.
comment in response to post
What the documentation didn’t mention is setting it to `all` could immediately trigger re-allocation in a few cases. For example, if a node has reached its watermark disk usage or if the cluster is unbalanced it may attempt to distribute shards.
comment in response to post
In this case it was about shard allocation settings. They were sharing that setting allocation to `all` shouldn’t cause any change. The documentation was technically correct in that sense that setting `all` doesn’t trigger reallocation. BUT…
comment in response to post
5? - AWS - Azure - GCP IBM? Oracle? Digital Ocean?
comment in response to post
Which is so funny because the software engineers I know have totally opposite reactions. I feel like GitHub has a kinda cult following and working here now I get it. They work super hard to ensure people are included and have good balance.
comment in response to post
I feel like if anyone is getting close it’s Algolia. They’ve seemed to create a pretty efficient black box for search
comment in response to post
Yeah which stinks because so many companies prefer to buy and integrate solutions rather than do data engineering.
comment in response to post
o19s.github.io/ubi/ This gets us closer but it can only measure and improve search with the attributes that currently exist. There are so many attributes that need feature engineering that just can’t be automated away.
comment in response to post
Every time they’d get a call the system would pop a new tab with the users profile. She consumed 100% of her 32GB with chrome tabs. It’s not for the average users that we need protections… it’s the extreme users 😆
comment in response to post
I worked in a call center once in a support type role. Lady called me over. Told me her chrome was running slow… See she has a lot of tabs open. Ask if she needs them anymore. She didn’t. I went to close chrome and it goes are you sure you want to close more than 500 tabs. 🫠
comment in response to post
The tinfoil on the key is so that the receiver will only read the RID tag on the old key and not the new one until I get a chance to reprogram it. The bike after the wreck for reference:
comment in response to post
It kills me that I just have the routine down but at least I know what’s coming ¯\_(ツ)_/¯
comment in response to post
It’ll give us a stable baseline and we can represent some typical and some challenging retrieval examples that we’ve seen in the past.
comment in response to post
Right but refreshing the data every quarter would also mean re-evaluating judgements. As I anticipate many of them would be dramatically different. After typing all this out over the last few days I’m feeling like synthetic data may be the best way to go.
comment in response to post
The data part is the hard part but I think I can create mock data so we can build some relevancy benchmarks. They won’t really be complete but each query might have 10-20 documents. Mostly I’d be focusing on recreating some challenges we’ve had with retrieval.
comment in response to post
So Quepid is a great tool but it’s really heavy to use. You need a database and there’s a lot of integration needed to get it working with the way GitHub does search. I’m thinking of making a self-contained CLI for relevancy. Something that only provides the most barebone relevancy api + data.
comment in response to post
I’m almost to the point where I think it would be best to create a really small fine-tuned data set for developing relevancy. Then we’ll just need to rely on online live relevancy metrics to monitor their success.
comment in response to post
Having an artificial dataset would be really challenging to maintain. We have a really diverse set of data and users. I can’t imagine we could reflect that group of people accurately. This is important especially now as we’re looking to dramatically change how documents are indexed.
comment in response to post
The challenge there is if I pull from source much of the data will have changed. For example, maybe we boost open issues? Well when I go to reindex the issues ones that were open will be closed and it will tank our relevancy. I could create an artificial dataset however…
comment in response to post
Yeah I’m beginning to think I need to start pushing little relevancy utilities to introduce the topic more so that we can start focusing on some of the more big picture stuff.
comment in response to post
These generated queries are typically used by bots to scrape updated issues or issues created by certain users. It’s a use case we want to support but it’s not something we want to “waste” vector search compute on.
comment in response to post
I don’t think semantic search would solve most of our problems however. I’m fairly sure a majority of our (human generated) searches are less than five terms. This is another challenge is we have a huge population of searches that use massive generated queries.
comment in response to post
It takes leadership who’s willing to commit to get these types of initiatives across the line. They aren’t going to be done in a few weeks or months. With the size of our data it’s probably a 2-4 year play but it would be so worth it.
comment in response to post
For issues, we can boost based on: - Number of comments - Whether the user is mentioned on the thread - Number of viewers of the issue - Highly referenced issues Actually, these could probably be used across the board to improve relevance.
comment in response to post
It’s so hard for people to understand just how much engineering there is for search. For example, there are probably dozens of attributes we could be using to boost different searches on GitHub.
comment in response to post
Speaking of search turns out getting organizational support to build relevancy tooling is more challenging than I anticipated. I need relevancy tools to show just how much room for improvement there is. I may just end up rolling a little python CLI for doing “dirty” relevancy measurements.
comment in response to post
You just listed the good reason. That way infosec can stop bugging me about annoying vulnerability notices /s
comment in response to post
I honestly wish the OpenSearch community was able to take better advantage of the excitement during the fork. There was and still is a lot of re-architecting that I wish would happen but I’m not sure will ever get the attention it needs.
comment in response to post
Imma take it through the ringer 🥊 Seriously, wtf though. You and the Valkey contributors are straight up killing it 🤘 Do you feel like work like this was blocked pre-fork or was it hard getting alignment?
comment in response to post
Yeah there’s been talk about redoing it for so long but I feel like it will never get funded 😭 Agreed though it’s always bothered me that Dashboards persists “recently accessed” items even after switching tenants.
comment in response to post
When you think about what Amazon makes money on (search on amazon.com) it makes a lot of sense that they have a good sense of what search engineers need. This release of OpenSearch now has learning to rank included by default. So many little things add up to an all around better search engine.