How does text analysis work?

twisty

Posts: 1282

Free Member

Topic starter

I recently discovered that grep was originally created for linguistic analysis of documents! This got me wondering about text analysis and how it works, but I find it tricky for several reasons, including linguistics not being my forte.

So for example, as far as I can make out there are 83,084 unique usernames on this forum. There are 11 just a single character long, the longest is 'erniethefastestmilkmaninthewest', a length of 8 characters is the most common with with 10,800 names. The distribution of username length looks like this.

https://drive.google.com/file/d/1HaNKVBHbKzrADT4yhH0CYj3XFpjI0zs_/view?usp=sharing

There are 947 Spams, 708 MTBs, 478 James' and 598 Andys

I find it tricky not only because I don't think I am getting to a meaningful analysis but also because analysing text is tricky. For example, there are 710 instances of 'Tom' but many of those are part of larger words like 'Atom'. There must be some clever techniques/tools that can match against a dictionary going from long to short words and only matching strings that have not already been matched with previous longer words?

I'm keen to learn from other peoples wisdom on how text analysis works. Just to finish this is what a wordcloud looks like although I suspect it might not be counting everything correctly.

https://drive.google.com/file/d/11FPS1wyHLWx9IJ7ZcBYwOrTW2nDoDkuI/view?usp=sharing

Posted : 17/10/2025 12:32 pm

nicko74

Posts: 5930

Full Member

I find it tricky not only because I don't think I am getting to a meaningful analysis but also because analysing text is tricky

mmmm... what's your objective here? Not just "wanting to learn about text analysis", but what you hope the data will actually show, and your hypothesis?

The meaning comes from how the data points to, or away from, your intended endpoint!

Posted : 17/10/2025 1:12 pm

sirromj

Posts: 8967

Free Member

Here are my non-specialist thoughts on the idea:

Not helped by the lack of space characters as delimiters. You can't definitively distinguish between some (ie atom cat, vs a tom cat) because there's no knowing the intent behind it. There will also be deliberate obfuscation. A guess could be taken, or assumptions made.

Regular expressions might be a part of it.

Perhaps you could find a word dictionary, sort it by shortest to longest words, then go through every single word, building some sort of sprawling data structure such that tom will link to atom, stomp, stomache, do to dos, doss, ado, sado, saddo. That will need you to design and code the data structure etc etc. I think hashing might be involved!

Password crackers might be a good thing to look at the workings of.

AI is the obvious tool to use.

But you'll still be faced with defining the problem to it, and defining how to deal with the outliers, the non-definitive cases. A rule to define them.

Look at mattamped, is it matt amped or mat tamped? One would expect the first is more likely.

Posted : 17/10/2025 1:20 pm

sirromj

Posts: 8967

Free Member

Regarding AI, you can't trust it.

I asked ChatGPT to find the longest word contained within another word, then to repeat the process a few times looking for shorter words. At the bottom of the results was the following:

5) Counter-demonstration chain (a different branching path)
counterdemonstrations — 21 letters — demonstrations held to oppose other demonstrations.
demonstrations — 14 letters — public shows of opinion or proof of something.
demonstration — 13 letters — an act of showing or proving; also a public protest.
station — 7 letters — a place or position (this appears as a substring inside demonstration).

Why did it list station instead of ration, and miss demon, rat, ions, on?

Posted : 17/10/2025 1:35 pm

WorldClassAccident

Posts: 13438

Full Member

Text analysis involves analysis, not just stats about a list of words. You need to think of some questions to ask and then analyse the data to see if it gives you the answer - remember that it might not which is also a valid outcome.

Possible analysis might be "How many people use 'normal' names within their username" eg nickc. "How many people user years in their user name" eg anything with a 4 digit number between 1900 and 2025. "How many people user 'normal' names and a date? eg Can I hack this person based on their name and DOB

To split tom from the atom you can run processes multiple times to spot words and then remove them from the next execution. Start with 15 letter words, then 14, 13, 12... This means that when the 4 letter process is run, it will find atom and that word won't get processed again when it looks for 3 letter words like tom.

Posted : 17/10/2025 2:09 pm

prettygreenparrot reacted

prettygreenparrot

Posts: 3325

Full Member

Posted by: twisty

↑

For example, there are 710 instances of 'Tom' but many of those are part of larger words like 'Atom'.

Mmm. There are some simple techniques for differentiating between word parts and whole words that are easily dealt with in grep and regex.

You might want to read some examples of using NLP. Plenty of them out there. edit - here's one

NICKRUTA.COM "http://nickruta.com/shakespeare.pdf"

and a link to NACTEM https://www.nactem.ac.uk/

The R library NLTK and its Python equivalent offer some useful information in their 'ReadMe' docs. Or just google 'NLP'.

Techniques have changed over the years I've dabbled in this field. It's all LLMs now (well, not really, but ...).

Oh, and please don't play with word clouds. Imo the text analytics equivalent of the pie chart - superficially pretty but next to useless for interpretation.

Posted by: sirromj

↑

Why did it list station instead of ration, and miss demon, rat, ions, on?

Because it is not doing what you want, or think, it might be doing. Seems like the 'how many 'r's in strawberry' amusement from a few months back.

Posted : 17/10/2025 3:10 pm

thols2

Posts: 12196

Full Member

Posted by: twisty

↑

I'm keen to learn from other peoples wisdom on how text analysis works.

It depends on what you are trying to do. For example, linguists often want to do parts of speech tagging, so the text is analyzed as words, but "run" might be a noun or a verb, so it's tagged as a different word depending on how it is used. You might want to investigate collocational patterns, i.e. which words tend to co-occur, and compare that across different types of text. Spoken language is very different to formal written language, so the patterns of word frequency and collocational use are very different, and the grammatical patterns are very different too.

Posted : 18/10/2025 11:45 am

joshvegas

Posts: 12729

Free Member

How many usernames include 69.

Posted : 19/10/2025 3:41 pm

CountZero

Posts: 33627

Full Member

Mine’s got nothing in it…

Posted : 19/10/2025 9:49 pm

toby reacted

sirromj

Posts: 8967

Free Member

Can't say I'd ever thought of the Count in your name in numerical terms!

Posted : 19/10/2025 11:05 pm

twisty

Posts: 1282

Free Member

Topic starter

Thanks for the replies, some useful things for me to look into.
I did do this a bit the wrong way round, rather than starting with a meaningful problem statement, I just picked usernames to make it STW relevant and poked at it in a less meaningful way.

367 usernames include "69", but again a portion of those are part of larger numbers, system generated names etc. But a conclusion we can draw is the 69:MTB:Spam ratio is approx 1:2:3, which perhaps only I find interesting 🤣

Posted : 20/10/2025 11:44 am

Spin

Posts: 7698

Free Member

It's still not clear what your purpose is.

Posted : 20/10/2025 12:29 pm

twisty

Posts: 1282

Free Member

Topic starter

Posted by: Spin

↑

It's still not clear what your purpose is.

I'm between jobs right now and always interested in learnding new things.

Posted : 21/10/2025 1:52 pm

joshvegas

Posts: 12729

Free Member

I did have a discussion with someone once about finding every place name that relates to water.

Liverpool, curlingpond road, etc for the whole of the uk.

Off you go and play with qgis to find out.

Posted : 21/10/2025 7:58 pm

How does text analysis work?

Latest Stories

Issue 164: Nowhere To Hide

Trek Fuel EX 8 longer term review

Issue 164: Guiding Lights

Fresh Goods Friday 780: The Flurries Not Slurries Edition