taidesu.bsky.social - Profile | ThreadSky | a Reddit-style client for Bluesky

comment in response to post

Elasticsearch doesn’t like having its shards messed with and at this point we need to stop pretending we can control it.

submitted 15 hours ago

comment in response to post

What the documentation didn’t mention is setting it to `all` could immediately trigger re-allocation in a few cases. For example, if a node has reached its watermark disk usage or if the cluster is unbalanced it may attempt to distribute shards.

submitted 15 hours ago

comment in response to post

In this case it was about shard allocation settings. They were sharing that setting allocation to `all` shouldn’t cause any change. The documentation was technically correct in that sense that setting `all` doesn’t trigger reallocation. BUT…

submitted 15 hours ago

comment in response to post

5? - AWS - Azure - GCP IBM? Oracle? Digital Ocean?

submitted 2 days ago

comment in response to post

Which is so funny because the software engineers I know have totally opposite reactions. I feel like GitHub has a kinda cult following and working here now I get it. They work super hard to ensure people are included and have good balance.

submitted 3 days ago

comment in response to post

I feel like if anyone is getting close it’s Algolia. They’ve seemed to create a pretty efficient black box for search

submitted 5 days ago

comment in response to post

Yeah which stinks because so many companies prefer to buy and integrate solutions rather than do data engineering.

submitted 5 days ago

comment in response to post

o19s.github.io/ubi/ This gets us closer but it can only measure and improve search with the attributes that currently exist. There are so many attributes that need feature engineering that just can’t be automated away.

submitted 5 days ago

comment in response to post

Every time they’d get a call the system would pop a new tab with the users profile. She consumed 100% of her 32GB with chrome tabs. It’s not for the average users that we need protections… it’s the extreme users 😆

submitted 5 days ago

comment in response to post

I worked in a call center once in a support type role. Lady called me over. Told me her chrome was running slow… See she has a lot of tabs open. Ask if she needs them anymore. She didn’t. I went to close chrome and it goes are you sure you want to close more than 500 tabs. 🫠

submitted 5 days ago

comment in response to post

The tinfoil on the key is so that the receiver will only read the RID tag on the old key and not the new one until I get a chance to reprogram it. The bike after the wreck for reference:

submitted 6 days ago

comment in response to post

It kills me that I just have the routine down but at least I know what’s coming ¯\_(ツ)_/¯

submitted 7 days ago

comment in response to post

It’ll give us a stable baseline and we can represent some typical and some challenging retrieval examples that we’ve seen in the past.

submitted 7 days ago

comment in response to post

Right but refreshing the data every quarter would also mean re-evaluating judgements. As I anticipate many of them would be dramatically different. After typing all this out over the last few days I’m feeling like synthetic data may be the best way to go.

submitted 7 days ago

comment in response to post

The data part is the hard part but I think I can create mock data so we can build some relevancy benchmarks. They won’t really be complete but each query might have 10-20 documents. Mostly I’d be focusing on recreating some challenges we’ve had with retrieval.

submitted 7 days ago

comment in response to post

So Quepid is a great tool but it’s really heavy to use. You need a database and there’s a lot of integration needed to get it working with the way GitHub does search. I’m thinking of making a self-contained CLI for relevancy. Something that only provides the most barebone relevancy api + data.

submitted 7 days ago

comment in response to post

I’m almost to the point where I think it would be best to create a really small fine-tuned data set for developing relevancy. Then we’ll just need to rely on online live relevancy metrics to monitor their success.

submitted 9 days ago

comment in response to post

Having an artificial dataset would be really challenging to maintain. We have a really diverse set of data and users. I can’t imagine we could reflect that group of people accurately. This is important especially now as we’re looking to dramatically change how documents are indexed.

submitted 9 days ago

comment in response to post

The challenge there is if I pull from source much of the data will have changed. For example, maybe we boost open issues? Well when I go to reindex the issues ones that were open will be closed and it will tank our relevancy. I could create an artificial dataset however…

submitted 9 days ago

comment in response to post

Yeah I’m beginning to think I need to start pushing little relevancy utilities to introduce the topic more so that we can start focusing on some of the more big picture stuff.

submitted 10 days ago

comment in response to post

These generated queries are typically used by bots to scrape updated issues or issues created by certain users. It’s a use case we want to support but it’s not something we want to “waste” vector search compute on.

submitted 11 days ago

comment in response to post

I don’t think semantic search would solve most of our problems however. I’m fairly sure a majority of our (human generated) searches are less than five terms. This is another challenge is we have a huge population of searches that use massive generated queries.

submitted 11 days ago

comment in response to post

It takes leadership who’s willing to commit to get these types of initiatives across the line. They aren’t going to be done in a few weeks or months. With the size of our data it’s probably a 2-4 year play but it would be so worth it.

submitted 11 days ago

comment in response to post

For issues, we can boost based on: - Number of comments - Whether the user is mentioned on the thread - Number of viewers of the issue - Highly referenced issues Actually, these could probably be used across the board to improve relevance.

submitted 11 days ago

comment in response to post

It’s so hard for people to understand just how much engineering there is for search. For example, there are probably dozens of attributes we could be using to boost different searches on GitHub.

submitted 11 days ago

comment in response to post

Speaking of search turns out getting organizational support to build relevancy tooling is more challenging than I anticipated. I need relevancy tools to show just how much room for improvement there is. I may just end up rolling a little python CLI for doing “dirty” relevancy measurements.

submitted 11 days ago

comment in response to post

You just listed the good reason. That way infosec can stop bugging me about annoying vulnerability notices /s

submitted 12 days ago

comment in response to post

I honestly wish the OpenSearch community was able to take better advantage of the excitement during the fork. There was and still is a lot of re-architecting that I wish would happen but I’m not sure will ever get the attention it needs.

submitted 12 days ago

comment in response to post

Imma take it through the ringer 🥊 Seriously, wtf though. You and the Valkey contributors are straight up killing it 🤘 Do you feel like work like this was blocked pre-fork or was it hard getting alignment?

submitted 13 days ago

comment in response to post

Yeah there’s been talk about redoing it for so long but I feel like it will never get funded 😭 Agreed though it’s always bothered me that Dashboards persists “recently accessed” items even after switching tenants.

submitted 18 days ago

comment in response to post

When you think about what Amazon makes money on (search on amazon.com) it makes a lot of sense that they have a good sense of what search engineers need. This release of OpenSearch now has learning to rank included by default. So many little things add up to an all around better search engine.

submitted 19 days ago