Topic: To the greatest of developers...

Offline Xenolightning

  • Moderator
  • Xenolightning is awe-inspiring!Xenolightning is awe-inspiring!Xenolightning is awe-inspiring!Xenolightning is awe-inspiring!Xenolightning is awe-inspiring!Xenolightning is awe-inspiring!Xenolightning is awe-inspiring!Xenolightning is awe-inspiring!Xenolightning is awe-inspiring!Xenolightning is awe-inspiring!Xenolightning is awe-inspiring!Xenolightning is awe-inspiring!
  • Posts: 3,485
I issue a challenge.

So for a little personal digital media library database, I want to clean up the names of my media files. I have got myself access to an online movie database, and I am retrieving the corrected title/synopsis/year/rating etc... But because of the wacky naming of some files, it is obviously returning no results.

I will release the application to my peers when it is finished

Some are formatted like this:

Van Wilder Freshman Year.2009.DvdRip.UR.Xvid {1337x}-Noir
Man.on.a.Ledge.2012.PROPER.DVDRip.XviD-SPARKS
The Grey (2012) DVDRip XviD-MAXSPEED

Can anyone think of an almost foolproof way of extracting the name/title of the file, and the year?

Can be in any language.


Year, (not 100% fool proof, but not bad):
Code: [Select]
//Get and remove the year in the filename
Match m = Regex.Match(name, "[0-9]{4}");
if (m.Success && (m.Value.StartsWith("19") || m.Value.StartsWith("20")))
{
   year = int.Parse(m.Value);
   fileName = fileName.Replace(m.Value, "");
}

Title/Keywords (the bit that needs the most attention):
Code: [Select]
//Strips out anything in []
fileName = Regex.Replace(fileName, "\\[.*\\]", "");

//lower it because we don't care yo!
fileName = fileName.ToLower();

//replace commonly used delimiters
fileName = fileName.Replace(".", " ");

//Kill off this bitch
fileName = fileName.Replace("dvdrip", "");

//these are mostly useless
if (fileName.IndexOf("(") > 0)
    fileName = fileName.Substring(0, fileName.IndexOf("("));

//Anything left after a - can die in a hole
if (fileName.IndexOf("-") > 0)
    fileName = fileName.Substring(0, fileName.IndexOf("-"));

//Query my server for an answer
client.GetClosestMatchByTitleKeywords(fileName, year);

I've thought about getting around this in the most elegant of ways, eg. getting google to do my correcting for me, but their search api isn't free :<


EDIT:
Also I should note that I am self caching search queries and results along with manual matches, so it's an almost learning system. The bigger my database gets the more accurate it will become, but I want to start it off as accurate as I can without having to code some AI alogrithm which I CBF with.
Last Edit: September 03, 2012, 08:57:18 pm by Xenolightning

Posted: September 03, 2012, 08:51:30 pm
-= Sad pug is sad =-

Offline Pyromanik

  • Hero Member
  • Pyromanik is awe-inspiring!Pyromanik is awe-inspiring!Pyromanik is awe-inspiring!Pyromanik is awe-inspiring!Pyromanik is awe-inspiring!Pyromanik is awe-inspiring!Pyromanik is awe-inspiring!Pyromanik is awe-inspiring!Pyromanik is awe-inspiring!Pyromanik is awe-inspiring!Pyromanik is awe-inspiring!Pyromanik is awe-inspiring!
  • Posts: 28,834
AST.

Reply #1 Posted: September 03, 2012, 09:41:55 pm
Everyone needs more Bruce Campbell.

Offline Pyromanik

  • Hero Member
  • Pyromanik is awe-inspiring!Pyromanik is awe-inspiring!Pyromanik is awe-inspiring!Pyromanik is awe-inspiring!Pyromanik is awe-inspiring!Pyromanik is awe-inspiring!Pyromanik is awe-inspiring!Pyromanik is awe-inspiring!Pyromanik is awe-inspiring!Pyromanik is awe-inspiring!Pyromanik is awe-inspiring!Pyromanik is awe-inspiring!
  • Posts: 28,834
PS, I really really really wish people would stop making DB's out of hashes and use something similar to 'thumbprints' that FDMF uses to make them. OH the amount of time that sexy SEXY little program has saved me, although it takes a fair bit of time to run itself... SO MUCH WIN.

http://freecode.com/projects/fdmf

Oh, last release 2010, I had no idea it was still active (used it in 2006, seemed pretty dead back then).





***

That said, seems like you're using commly spread 'scene' rips anyway, a hash based DB probably wouldn't be all that bad.
Last Edit: September 03, 2012, 09:48:30 pm by Pyromanik

Reply #2 Posted: September 03, 2012, 09:44:43 pm
Everyone needs more Bruce Campbell.

Offline Pyromanik

  • Hero Member
  • Pyromanik is awe-inspiring!Pyromanik is awe-inspiring!Pyromanik is awe-inspiring!Pyromanik is awe-inspiring!Pyromanik is awe-inspiring!Pyromanik is awe-inspiring!Pyromanik is awe-inspiring!Pyromanik is awe-inspiring!Pyromanik is awe-inspiring!Pyromanik is awe-inspiring!Pyromanik is awe-inspiring!Pyromanik is awe-inspiring!
  • Posts: 28,834
Quote from: Xenolightning;1501291
Van Wilder Freshman Year.2009.DvdRip.UR.Xvid {1337x}-Noir
Man.on.a.Ledge.2012.PROPER.DVDRip.XviD-SPARKS
The Grey (2012) DVDRip XviD-MAXSPEED

/[ .]|\(?(\d{4})\)?.*$/

One could expand on this (assumption that everything after date is shit, which it usually is) to get a slightly safer regex (ensures the date is followed by 'dvdrip', 'bdrip' or 'proper' before smashing it to shit):
/[ .]|\(?(\d{4})\)?[ .]\[?(dvd|bd|prop).*$/i

Replace either of the above with " $1" and you'll end up with a nice title. Do an extra pass over if you want to format the date nicer than just having a (double) space before it.

I did a JSFiddle to display results, but turns out JSFiddle is full of shit, only half works when it comes to regex. Odd.

So instead open your web browser console (ctrl+shift+k in firefox or F12 in anything) and run:
Code: [Select]
var lederp = [
    &quot;Van Wilder Freshman Year.2009.DvdRip.UR.Xvid {1337x}-Noir&quot;,
    &quot;Man.on.a.Ledge.2012.PROPER.DVDRip.XviD-SPARKS&quot;,
    &quot;The Grey (2012) DVDRip XviD-MAXSPEED&quot;,
    &quot;Men In Black III 2012 DVDRip XViD-REFiLL&quot;,
    &quot;The Avengers (2012) [BDRip720p Ita-Eng][A.C.U.M.]&quot;,
    &quot;Frog Dreaming - The Quest - Go Kids&quot;,
    &quot;Parpaillon (1993).Fra.L.Moullet&quot;
];
for(var i = 0; i < lederp.length; i++)
    console.log(lederp[i].replace(/[ .]|\(?(\d{4})\)?[ .]\[?(dvd|bd|prop).*$/ig, &quot; $1&quot;));

Extra items there are real piratebay options.

***

Though it should be said if you're parsing anything irregular with regular expressions... you're doing it wrong.
Last Edit: September 03, 2012, 11:52:03 pm by Pyromanik

Reply #3 Posted: September 03, 2012, 09:55:12 pm
Everyone needs more Bruce Campbell.

Offline Xenolightning

  • Moderator
  • Xenolightning is awe-inspiring!Xenolightning is awe-inspiring!Xenolightning is awe-inspiring!Xenolightning is awe-inspiring!Xenolightning is awe-inspiring!Xenolightning is awe-inspiring!Xenolightning is awe-inspiring!Xenolightning is awe-inspiring!Xenolightning is awe-inspiring!Xenolightning is awe-inspiring!Xenolightning is awe-inspiring!Xenolightning is awe-inspiring!
  • Posts: 3,485
Yeah thumbprints are probably out of the question, and they won't help to get the title of the movie anyway. Only useful once a database is built.

I've thought about AST, but it really doesn't achieve anything more. Am still stuck on the issue of what constitutes the title, and what is garbage.

I've thought about building a dictionary of garbage words/phrases and comparing the filename's against the dictionary, but alas I still have the issue of dictionary building. And it will probably require multiple rechecks of the file name afterwards too, and there is a query limit to the movie api.

In this situation a crude solution is more than enough, if it sorts out 95% of media files the last part is easy enough to do manually.

EDIT:

I should probably state, I don't particularly want an answer, just some idea's :-S
Last Edit: September 03, 2012, 10:20:06 pm by Xenolightning

Reply #4 Posted: September 03, 2012, 10:15:03 pm
-= Sad pug is sad =-

Offline Bell

  • Addicted
  • Bell is on the verge of being accepted.Bell is on the verge of being accepted.Bell is on the verge of being accepted.Bell is on the verge of being accepted.Bell is on the verge of being accepted.
  • Posts: 4,263
Can't think really of a much better way than you are currently doing.

I think an option that might get you the most accurate results but could be quite slow would be to just strip the delimiters and go through entire database of movie titles seeing if a movie title is a substring of your media file.
That way it doesn't really matter what junk there is after the filename.
You might run into some issues with Sequels though, but maybe that's when you can cross-reference the year you retrieved using your current method.
Last Edit: September 03, 2012, 11:16:23 pm by Bell

Reply #5 Posted: September 03, 2012, 11:00:23 pm

Offline toofast

  • Addicted
  • toofast barely matters.toofast barely matters.
  • Posts: 3,697
I do something similar with tv shows, and just use regex, looking for the letter number break (\b i think), and assuming everything before the number break is title, and everything after (numeric) is year/episodenumber, and the rest scene junk.. But it can break if you have a file with numbers in the middle which correspond to a year. I usually guess what my title is, then search it on an online db, and only rename if it returns a match, which means you are only left with a handful of files which have to be done manually.

Reply #6 Posted: September 03, 2012, 11:02:25 pm

Offline Pyromanik

  • Hero Member
  • Pyromanik is awe-inspiring!Pyromanik is awe-inspiring!Pyromanik is awe-inspiring!Pyromanik is awe-inspiring!Pyromanik is awe-inspiring!Pyromanik is awe-inspiring!Pyromanik is awe-inspiring!Pyromanik is awe-inspiring!Pyromanik is awe-inspiring!Pyromanik is awe-inspiring!Pyromanik is awe-inspiring!Pyromanik is awe-inspiring!
  • Posts: 28,834
Quote from: Xenolightning;1501291
Van Wilder Freshman Year.2009.DvdRip.UR.Xvid {1337x}-Noir
Man.on.a.Ledge.2012.PROPER.DVDRip.XviD-SPARKS
The Grey (2012) DVDRip XviD-MAXSPEED

/[ .]|\(?(\d{4})\)?.*$/

One could expand on this (assumption that everything after date is shit, which it usually is) to get a slightly safer regex (ensures the date is followed by 'dvdrip', 'bdrip' or 'proper' before smashing it to shit):
/[ .]|\(?(\d{4})\)?[ .]\[?(dvd|bd|prop).*$/i

Replace either of the above with " $1" and you'll end up with a nice title. Do an extra pass over if you want to format the date nicer than just having a (double) space before it.

I did a JSFiddle to display results, but turns out JSFiddle is full of shit, only half works when it comes to regex. Odd.

So instead open your web browser console (ctrl+shift+k in firefox or F12 in anything) and run:
Code: [Select]
var lederp = [
    &quot;Van Wilder Freshman Year.2009.DvdRip.UR.Xvid {1337x}-Noir&quot;,
    &quot;Man.on.a.Ledge.2012.PROPER.DVDRip.XviD-SPARKS&quot;,
    &quot;The Grey (2012) DVDRip XviD-MAXSPEED&quot;,
    &quot;Men In Black III 2012 DVDRip XViD-REFiLL&quot;,
    &quot;The Avengers (2012) [BDRip720p Ita-Eng][A.C.U.M.]&quot;,
    &quot;Frog Dreaming - The Quest - Go Kids&quot;,
    &quot;Parpaillon (1993).Fra.L.Moullet&quot;
];
for(var i = 0; i < lederp.length; i++)
    console.log(lederp[i].replace(/[ .]|\(?(\d{4})\)?[ .]\[?(dvd|bd|prop).*$/ig, &quot; $1&quot;));

Extra items there are real piratebay options.


Though it should be said if you're parsing anything irregular with regular expressions... you're doing it wrong.
regex is for tokenising, not parsing.

Reply #7 Posted: September 03, 2012, 11:53:29 pm
Everyone needs more Bruce Campbell.

Offline Xenolightning

  • Moderator
  • Xenolightning is awe-inspiring!Xenolightning is awe-inspiring!Xenolightning is awe-inspiring!Xenolightning is awe-inspiring!Xenolightning is awe-inspiring!Xenolightning is awe-inspiring!Xenolightning is awe-inspiring!Xenolightning is awe-inspiring!Xenolightning is awe-inspiring!Xenolightning is awe-inspiring!Xenolightning is awe-inspiring!Xenolightning is awe-inspiring!
  • Posts: 3,485
The Javascript you wrote isn't bad, but titles like 10,000.B.C[2009]DvDrip breaks that javascript.

The regex I am using is not for the parsing, but for finding a possible date in the title. Nothing more.

I'm well aware that using regex for this is more or less impossible. Looks like if I continue I will be using a "bad word" dictionary, and using multiple movie databases for queries.

Hopefully it produces some worth while results.

Reply #8 Posted: September 04, 2012, 08:22:51 am
-= Sad pug is sad =-

Offline Pyromanik

  • Hero Member
  • Pyromanik is awe-inspiring!Pyromanik is awe-inspiring!Pyromanik is awe-inspiring!Pyromanik is awe-inspiring!Pyromanik is awe-inspiring!Pyromanik is awe-inspiring!Pyromanik is awe-inspiring!Pyromanik is awe-inspiring!Pyromanik is awe-inspiring!Pyromanik is awe-inspiring!Pyromanik is awe-inspiring!Pyromanik is awe-inspiring!
  • Posts: 28,834
Yeh, I'm aware things like 10kBC would break, you could stretch it out to adapt... but fuck that :P
Also the js is just for example, the regex is the important bit. Generally language agnostic, as most are based on PCRE.

I have to say, I quite like Bell's idea the most really. The more I think about it, the more it makes sense.

Reply #9 Posted: September 04, 2012, 05:35:04 pm
Everyone needs more Bruce Campbell.

Offline Xenolightning

  • Moderator
  • Xenolightning is awe-inspiring!Xenolightning is awe-inspiring!Xenolightning is awe-inspiring!Xenolightning is awe-inspiring!Xenolightning is awe-inspiring!Xenolightning is awe-inspiring!Xenolightning is awe-inspiring!Xenolightning is awe-inspiring!Xenolightning is awe-inspiring!Xenolightning is awe-inspiring!Xenolightning is awe-inspiring!Xenolightning is awe-inspiring!
  • Posts: 3,485
I've got a decent matching of 90% of my media library, using a bad word dictionary and a few more bits of logic. Cross referencing the year if it exists.

Reply #10 Posted: September 18, 2012, 12:46:04 pm
-= Sad pug is sad =-

Offline Xsannz

  • Addicted
  • Xsannz is awe-inspiring!Xsannz is awe-inspiring!Xsannz is awe-inspiring!Xsannz is awe-inspiring!Xsannz is awe-inspiring!Xsannz is awe-inspiring!Xsannz is awe-inspiring!Xsannz is awe-inspiring!Xsannz is awe-inspiring!Xsannz is awe-inspiring!Xsannz is awe-inspiring!Xsannz is awe-inspiring!
  • Posts: 5,412
Hmmmm I have a program that worked 99 percent of the time I use it for tv shows and movies no issues to date on my 10tb of movies ill find the link when I get home.

Reply #11 Posted: September 18, 2012, 01:05:02 pm

Offline Xenolightning

  • Moderator
  • Xenolightning is awe-inspiring!Xenolightning is awe-inspiring!Xenolightning is awe-inspiring!Xenolightning is awe-inspiring!Xenolightning is awe-inspiring!Xenolightning is awe-inspiring!Xenolightning is awe-inspiring!Xenolightning is awe-inspiring!Xenolightning is awe-inspiring!Xenolightning is awe-inspiring!Xenolightning is awe-inspiring!Xenolightning is awe-inspiring!
  • Posts: 3,485
Yeah would be good. The one I'm creating will have a few more uses than just finding the title of media.

Reply #12 Posted: September 18, 2012, 01:19:56 pm
-= Sad pug is sad =-

Offline Speakman

  • Hero Member
  • Speakman is awe-inspiring!Speakman is awe-inspiring!Speakman is awe-inspiring!Speakman is awe-inspiring!Speakman is awe-inspiring!Speakman is awe-inspiring!Speakman is awe-inspiring!Speakman is awe-inspiring!Speakman is awe-inspiring!Speakman is awe-inspiring!Speakman is awe-inspiring!Speakman is awe-inspiring!
  • Posts: 12,562
For my TV shows, I use this:

http://tvrename.en.softonic.com/download


Check it out and see if you get any ideas

Reply #13 Posted: September 18, 2012, 06:41:12 pm
Quote from: Mellcor
i had kinda hope speakman had died, what a pity

Offline Slingshot

  • Just settled in
  • Slingshot is looked down upon.
  • Posts: 102
Freebase API
I do v. similar to what you already have.

Strip shit, copy that which comes before year. Kill shit after year.
And search.

Works 99.99% of the time. It fails for remakes where the file/folder name is missing year.

+
Do you name your folders correctly? My app looks at the folder name if it can't get anything out of the name of the file.

e.g I have some strange file names like ikr-rent034.avi which of course is Rent (2005)
The folder name is more along the lines of Rent (2005)

Bad results on file name search, but #1 on folder name search

Reply #14 Posted: September 18, 2012, 07:09:57 pm

Offline Xenolightning

  • Moderator
  • Xenolightning is awe-inspiring!Xenolightning is awe-inspiring!Xenolightning is awe-inspiring!Xenolightning is awe-inspiring!Xenolightning is awe-inspiring!Xenolightning is awe-inspiring!Xenolightning is awe-inspiring!Xenolightning is awe-inspiring!Xenolightning is awe-inspiring!Xenolightning is awe-inspiring!Xenolightning is awe-inspiring!Xenolightning is awe-inspiring!
  • Posts: 3,485
Freebase looks mint, will give that a whirl.

Algorithm is working fine at the moment, its the API that has the limitations imposed by inputs and its searching capabilities.

I feel if I use 2 API's for redundancy I will be up to 99%.

Regarding the Folder/FileName combination, yes it takes into account the parent/s folder name. With plan to use it for TV Series also, so it will support multiple level look backs.

Reply #15 Posted: September 18, 2012, 09:17:05 pm
-= Sad pug is sad =-

Offline toofast

  • Addicted
  • toofast barely matters.toofast barely matters.
  • Posts: 3,697
Do post what you come up with (with source code if possible). I have a program for TV show renaming, which auto monitors a folder all my tv/movies go into for sorting. So would be nice to identify a non-TV show (through regex), and then throw it into your system, and come out with a rename, and then auto move to correct folder.

Reply #16 Posted: September 18, 2012, 09:52:29 pm

Offline Slingshot

  • Just settled in
  • Slingshot is looked down upon.
  • Posts: 102
I think freebase has limits but it's 100,000 reads/day. It also has a downloadable local version. ^_^

Reply #17 Posted: September 19, 2012, 12:39:09 pm

Offline toofast

  • Addicted
  • toofast barely matters.toofast barely matters.
  • Posts: 3,697
So what was the end result of this? Did you come up with anything good?

Reply #18 Posted: March 02, 2013, 09:56:00 am

Offline Xenolightning

  • Moderator
  • Xenolightning is awe-inspiring!Xenolightning is awe-inspiring!Xenolightning is awe-inspiring!Xenolightning is awe-inspiring!Xenolightning is awe-inspiring!Xenolightning is awe-inspiring!Xenolightning is awe-inspiring!Xenolightning is awe-inspiring!Xenolightning is awe-inspiring!Xenolightning is awe-inspiring!Xenolightning is awe-inspiring!Xenolightning is awe-inspiring!
  • Posts: 3,485
About as far as to say. It's mildly difficult, and if other wanted to use it I'd have to set up servers.

So I put it indefinitely on hold.

Reply #19 Posted: March 02, 2013, 10:08:42 pm
-= Sad pug is sad =-

Offline Apostrophe Spacemonkey

  • Fuck this title in particular.

  • Apostrophe Spacemonkey is awe-inspiring!Apostrophe Spacemonkey is awe-inspiring!Apostrophe Spacemonkey is awe-inspiring!Apostrophe Spacemonkey is awe-inspiring!Apostrophe Spacemonkey is awe-inspiring!Apostrophe Spacemonkey is awe-inspiring!Apostrophe Spacemonkey is awe-inspiring!Apostrophe Spacemonkey is awe-inspiring!Apostrophe Spacemonkey is awe-inspiring!Apostrophe Spacemonkey is awe-inspiring!Apostrophe Spacemonkey is awe-inspiring!Apostrophe Spacemonkey is awe-inspiring!
  • Posts: 19,050
Quote from: Xenolightning;1519612
So I put it indefinitely on hold.

That's the best kind of hold.

Reply #20 Posted: March 04, 2013, 11:05:52 am

Offline Xenolightning

  • Moderator
  • Xenolightning is awe-inspiring!Xenolightning is awe-inspiring!Xenolightning is awe-inspiring!Xenolightning is awe-inspiring!Xenolightning is awe-inspiring!Xenolightning is awe-inspiring!Xenolightning is awe-inspiring!Xenolightning is awe-inspiring!Xenolightning is awe-inspiring!Xenolightning is awe-inspiring!Xenolightning is awe-inspiring!Xenolightning is awe-inspiring!
  • Posts: 3,485
Quote from: Spacemonkey;1519694
That's the best kind of hold.
Agree.

The idea of it was very cool, but it requires a lot of my time to get right. And TBH, I really don't care enough.

Reply #21 Posted: March 04, 2013, 11:43:08 am
-= Sad pug is sad =-

Offline Pyromanik

  • Hero Member
  • Pyromanik is awe-inspiring!Pyromanik is awe-inspiring!Pyromanik is awe-inspiring!Pyromanik is awe-inspiring!Pyromanik is awe-inspiring!Pyromanik is awe-inspiring!Pyromanik is awe-inspiring!Pyromanik is awe-inspiring!Pyromanik is awe-inspiring!Pyromanik is awe-inspiring!Pyromanik is awe-inspiring!Pyromanik is awe-inspiring!
  • Posts: 28,834
Watch.
Shift, Del.

Like?
Copy, paste.
Solid.

Reply #22 Posted: March 04, 2013, 10:19:12 pm
Everyone needs more Bruce Campbell.

Offline Xenolightning

  • Moderator
  • Xenolightning is awe-inspiring!Xenolightning is awe-inspiring!Xenolightning is awe-inspiring!Xenolightning is awe-inspiring!Xenolightning is awe-inspiring!Xenolightning is awe-inspiring!Xenolightning is awe-inspiring!Xenolightning is awe-inspiring!Xenolightning is awe-inspiring!Xenolightning is awe-inspiring!Xenolightning is awe-inspiring!Xenolightning is awe-inspiring!
  • Posts: 3,485
Mmmm no, need more lemon pledge.

Reply #23 Posted: March 04, 2013, 11:09:58 pm
-= Sad pug is sad =-

Offline Pyromanik

  • Hero Member
  • Pyromanik is awe-inspiring!Pyromanik is awe-inspiring!Pyromanik is awe-inspiring!Pyromanik is awe-inspiring!Pyromanik is awe-inspiring!Pyromanik is awe-inspiring!Pyromanik is awe-inspiring!Pyromanik is awe-inspiring!Pyromanik is awe-inspiring!Pyromanik is awe-inspiring!Pyromanik is awe-inspiring!Pyromanik is awe-inspiring!
  • Posts: 28,834
k.

Reply #24 Posted: March 04, 2013, 11:35:23 pm
Everyone needs more Bruce Campbell.