mattkerlogue.bsky.social - Profile | ThreadSky | a Reddit-style client for Bluesky

comment in response to post

@mattdray.bsky.social is here

submitted 1 day ago

comment in response to post

When I was dealing with datasets of 200,000+ observations, there’s a threshold at which you accept some stuff can’t be resolved and that’s somewhat driven whether the level of “dirt” actually has an impact on the final output.

submitted 4 days ago

comment in response to post

On the impact point. I’ve recently been working on a project on country level data, so 100-200 observations. I’ve invested quite a lot of time in cleaning the data because an error here is highly influential in the output.

submitted 4 days ago

comment in response to post

Smaller datasets get proportionally more time but it’s also easier to identify each individual cleaning action that’s needed so I spend more time on customised cleans on top of the initial generalised routines.

submitted 4 days ago

comment in response to post

I think yes. There’s a simple volume piece to it, the impact of a small number of “dirty” data points is higher. But also I think it’s partly about time. I don’t think my time cleaning data scales linearly with volume.

submitted 4 days ago

comment in response to post

Yes, I didn't think {targets} was likely to be the right thing as it's very tightly concerned with pipelines but thought I'd raise it since the inspector/visualiser functions might help, but also might not.

submitted 14 days ago

comment in response to post

Whereas the first two packages haven't been updated in years, {targets} is still regularly maintained (and is on CRAN).

submitted 14 days ago

comment in response to post

There's also the {targets} package which is more about analytical-pipelines but has functions for inspecting and visualising dependencies docs.ropensci.org/targets/refe...

submitted 14 days ago

comment in response to post

Yes, I wondered if the second might be more useful. I also mis-read the very end of your post as "and to manage chaos", which depending on the nature of the package might also be true (at least in my own experience) 😂

submitted 14 days ago

comment in response to post

As a social scientist that had worked largely in SPSS and Stata, my biggest challenge when first engaging with R was that you didn't get different types of missing-ness. The {haven} package's tagged_na() is very handy. haven.tidyverse.org/reference/ta...

submitted 14 days ago

comment in response to post

If you can install from GitHub then there's two fairly old packages not on CRAN that I know of: github.com/crsh/depgraph and datastorm-open.github.io/Dependencies...

submitted 14 days ago

comment in response to post

Humans know to use different tools for different things, ergo a generalised LLM chatbot shouldn’t be trying to do maths… it should be asking a thing that can do maths to do the calculation and then returning the result.

submitted 15 days ago

comment in response to post

It’s perfect example of why I’m not sceptical about AI in general but I am sceptical about the current state of generalised AI/LLMs and what folk are using them for… they shouldn’t be/are unlikely to be an everything tool until they get past their own ego.

submitted 15 days ago

comment in response to post

I have now! This was a delight thanks! 😊

submitted 15 days ago

comment in response to post

[37]: the language written in England: some people living in the USA appropriate this name for their language. [38]: with Americanisms.

submitted 15 days ago

comment in response to post

"For example, R has catalogues for ‘en_GB’ that translate the Americanisms (e.g., ‘gray’) in the standard messages into English.[37] ... If no suitable translation catalogue is found or a particular message is not translated in any suitable catalogue, ‘English’[38] is used."

submitted 15 days ago

comment in response to post

But today's find is this wonderful footnote deep in the #rstats install documentation cran.r-project.org/doc/manuals/...

submitted 15 days ago

comment in response to post

You can't mention #rstats documentation without mentioning Clippy-gate in the {writexl} package github.com/ropensci/wri...

submitted 15 days ago

comment in response to post

My first experience was years ago when I was working out why medians in Stata and #rstats weren't the same, which is due to different quantile calculations. But in the ?quantile documentation you learn that the 'type 5' quantile "is popular amongst hydrologists".

submitted 15 days ago

comment in response to post

If my function is complex and/or has multiple possible outputs then I like to use return to be explicit to others (especially future me!). But if the function is a simple helper where it’s just encapsulating a single repeated pipe chain or only a few lines of code then I’ll not both with a return.

submitted 16 days ago

comment in response to post

Updated because the blog link got borked via a copy paste :/ Lesson here... re-check the links if you copy a draft post and then paste it somewhere else.

submitted 23 days ago

comment in response to post

A little bot that cruises the canals and posts a random location from the CRT's network, and if it exists a nearby photo published on Flickr (@flickrfdn.bsky.social ). The bot is written in #rstats and powered by Github Actions. You can read more here: lapsedgeographer.london/2020-10/virt...

submitted 23 days ago

comment in response to post

Oops... thanks for flagging, working link here: lapsedgeographer.london/2020-10/virt... Current code: github.com/mattkerlogue...

submitted 23 days ago

comment in response to post

This covers when it ran on Twitter, that's since stopped but it's pretty much the same for Mastodon and Bsky, the code is here: github.com/mattkerlogue...

submitted 23 days ago

comment in response to post

Oops, half the link got curtailed into "..." lapsedgeographer.london/2020-10/virt...

submitted 23 days ago

comment in response to post

Re the missing data point you make at the end. Thinking of the facet chart you show I've sometimes displayed missing data as an outline/unfilled circle, or used a different shape (e.g an x).

submitted 23 days ago

comment in response to post

This was a fab post, thanks. Far too few think about these sorts of issues so it's great to have a resource to point to on it.

submitted 23 days ago

comment in response to post

Shout out to @mattdray.bsky.social whose, now defunct, londonmapbot was the inspiration to build the @narrowbotr.bsky.social in the first place. The bot also exists on Mastodon too (mastodon.social/@narrowbotr@...)

submitted 23 days ago

comment in response to post

Currently I'm working on the Blavatnik Index of Public Administration for @blavatnikschool.bsky.social index.bsg.ox.ac.uk

submitted 23 days ago

comment in response to post

I spent far too long at the Cabinet Office, where amongst many other things I was the lead analyst involved in helping set up the Civil Service People Survey (www.gov.uk/government/c...).

submitted 23 days ago

comment in response to post

You can read more about the Blavatnik Index and explore the results here: index.bsg.ox.ac.uk

submitted 23 days ago

comment in response to post

Something like this (from education.economist.com/insights/int...)

submitted 26 days ago

comment in response to post

Can I suggest using points for the male/female score and a bar between them (geom_linerange in ggplot) rather than using separate bars. I think this would make it a lot easier to visualise both the salience patterns of the issues for each sex as well as any differences.

submitted 26 days ago

comment in response to post

Well that makes me feel a lot less bad about also not realising it was you either!!

submitted 37 days ago

comment in response to post

@gavinfreeguard.bsky.social has just told me he was chatting to you when I was next to him FML 🤦‍♂️🤦‍♂️🤦‍♂️🤦‍♂️🤦‍♂️🤦‍♂️🤦‍♂️

submitted 37 days ago

comment in response to post

Very nearly went to the local/regional chat.

submitted 37 days ago

comment in response to post

Aaah. Wonderful. Shame not to have spotted you, I know I ought to have posted about being there but still only just starting to re-engage with the whole social media whatnot.

submitted 37 days ago

comment in response to post

Oh. Were you at the Stats Assembly?

submitted 37 days ago

comment in response to post

I have had a complete essay in my head about this ~10 years. If only I could actually force myself to write it.

submitted 38 days ago

comment in response to post

Long-story short it’s largely due to emoji that have multiple code points messing up the calculation of the start/end bytes.

submitted 40 days ago

comment in response to post

@paul.wales I think I’ve worked out the error causing this problem. I’ll apply a temporary fix in the next day or so while I work on a PR for the underlying package.

submitted 40 days ago