Tuesday, December 16, 2008

Some basic text mining terminology

Corpus A collection of texts or documents
Sentiment Analysis Aims to determine the attitude of a speaker or a writer with respect to some topic
Lexicon Dictionary or encyclopedia
Taxonomy Arrangement based on hierarchial structure
Multi-word Term A group of words represented by a single term
Entities Names, addresses, Social Security Numbers, Company names, etc.

Twitter

This is a good article on how companies are starting to make money on Twitter http://www.internetnews.com/webcontent/article.php/3790161/What%20Keeps%20Twitter%20Chirping%20Along.htm#

Friday, December 12, 2008

Storage de-duplication

Right now de-duplication is being used mainly as a way to decrease the size of backup data sets. I have made the Data Domain 5 TB appliance a key part of my backup strategy in the past. Basically I was able to write around 50 TB of data to the appliance even though it only had 5 TB of actual disk.

How you do this isnt something I want to go into. Basically it breaks data into chunks, compares them to what is already in the system, if something is already there it uses a pointer instead of storing the data.

Since we were backing up entire VMs, we were getting 20X de-dupe ratios because of the redundancy with the c drive, for example.

My whole issue is that now de-duplication is thought of as a technology solution. You buy a data domain, anything you write to it is de-duped. I dont think that line of thought makes sense.

I think there are two things that must change. The first is that de-duplication should be thought of as a feature or service. You own a Clariion array, you own a DMX, you own an EVA. You can enable de-duplication. It is a service. Not a hardware solution like Data Domain.

With Data Domain you have silos of de-duplicated data. For this reason the second thing that I think will have to happen soon is global de-duplication. I have always thought of this since I can remember understanding storage. It has yet to come in full swing, but must happen.

The growth rates are so high and there is so much waste that a global de-duplication system must happen. Basically something that sits above all of your physical storage, on the same level or part of storage virtualization, that goes into all of your data and de-duplicates the entire enterprise. Now you dont have to worry about seperate de-dupe silos.

The virtual layer can handle the data access and pointers. My bet is that the ratios will be significantly high. Maybe not as high as it is with backup for obvious reasons.

Along with this is the need to be able to access de-duplicated data at fast speeds. This is the hard part and can partly be handled through caching or tiering within the de-duplicated environment.

Thursday, December 11, 2008

LSI

I got a demo from Leximancer today. I like the idea that it doesn't use a Lexicon. I couldn't, however, get an answer on what algorithms it uses. Does it use LSI? What is LSI?

From Webopedia: Latent semantic indexing it is an algorithm used by search engines to determine what a page is about outside of specifically matching search query text. The LSI algorithm doesn't actually understand the meanings of words on the page but it can spot patterns of related words. LSI will may return relevant results that don't contain the keyword at all, but those pages with related words.

I think Google one day made a change to use this and everybody panicked cause they were basing their relevance on pure keywords.

Does Leximancer do this?

Tuesday, December 9, 2008

Virtualization and crack

I led a virtualization strategy project at a large hedge fund. This involced collecting application requirements including mainly performance requirements, whether or not the applications could be virtualized according to the application vendor (pretty much everything can technically), downtime, RTO, RPO, backup requirements, etc.

Once I had these requirments I used the VMware tool Capacity Planner to assist me in collecting performance data. It also makes recommendations on whether or not a server can be virtualized based on parameters put into the system.

A list was created, I spoke with the business and developed a migration plan. Luckily at the time the P2V built into VMware was good enough. I have used Platespin in the past, but the latest version with ESX works well.

After executing the migration plan, using online migration methods for increased uptime (which is also built into the P2V tool) I sometimes had it in the plan to migrate the server to a different data center as I was also leading a data center migration project at the time.

This allowed for the ability to migrate a VM while it was being virtualized as I was also leading a data center migration project at the same time. I would import the VM into the target environment on target storage. We had the two data centers connected from a network perspective. I have had it where this wasnt possible for another client and used SRDF to migrate the VMs, which worked great and allowed for instant migrations.

My whole point of this was to tell you that the worst part of all this was that everyone liked it. Too much. We went from shoving it down peoples throats that VMs were the best and people eventually liked it so much that they became addicted. We eventually had to pull back. I set it up to be able to build a fully functional VM in 10 minutes. This made people sick to their stomachs and they wanted more.

I then built a cost model, which I will talk about later, to charge people for VMs because demand was too high. So what is the difference between a VM and a crack dealer? I guess the come down is more planned in one instance.

Monday, December 8, 2008

But what about the economy?

As I move between gigs it is always funny to hear my recruiter friends try to scare me. The current tactic to try and bring you down is the economy. I tell them my rate and they say "oooooo...you know that everyone is cutting rates?" Everybody is hurting they say. Yes, this may be true, but why are you calling me? Cause you need my skills?

Fear rules all. This is how some presidents win elections. The issue is that I understand this and come out ahead because I ignore the fear. While others are taking full time positions I am able to stay a consultant with less competition.

Quit trying to scare me. I do understand that you are scared yourself.

Friday, December 5, 2008

Storage Virtualization

From Wiki: Storage virtualization refers to the process of abstracting logical storage from physical storage. The term is today used to describe this abstraction at any layer in the storage software and hardware stack.

I have done evaluations on EMC Invista (block virtualization) and the cost is very prohibitive to the organization I was working with. We had the Cisco 9506 switches necessary for the SSM module to snap in storage virtualization, but the SSM module itself was 250K or so with the software for both SANs.

At that time the only thing Invista was really good at was storage migrations. Yes, that is a main feature of storage virtualizations, but I wanted some ILM. I want this thing to understand the data and auto-move based on rules. I believe the latest version has some basic capabilities to do this, but I can do storage migrations in the background using other tools, even cheesy ones like Open Migrator.

By the way Open Migrator works very well for WIndows hosts. I would just do it yourself. Dont pay EMC to do the services as the product is a cake walk. Basically you create a pair with the targe LUN, it syncs without any affect to the host, when it is done you reboot and it switches over. It does require a couple reboots though depending on if the driver is needed.

I would welcome conversation on any real examples of Invista. I know that HDS and IBM have been doing their version of storage virtualization and have real customers. EMC, not so much.

Clarabridge

Here is a good article on Clarabridge:

http://www.intelligententerprise.com/blog/archives/2008/11/clarabridge_foc.html

Storage Transformation

I have completed many storage transformations for many large organizations. I think most companies need one. To me a storage transformation is taking a look at multiple areas: Tiering (Storage services catalog), application alignment, storage cost optimization, storage utlization, backup and recovery assessment and archive.

Basically coming in using a tool such as ECC to collect data on all of the SAN attached hosts, meet with the business/application folks with a questionnaire to talk about the application and its requirements and do an infrastructure assessment.

For the storage infrastructure, create a storage services catalog that will meet all of the business requirements, fill in any gaps that exist. Use the business data to align applications to the appropriate tier.

Before you actually align the tier, come up with a good utilization target. 80% is a good one. With thin provisioning this step becomes much easier to implement, but assuming this doesnt exist, when you tier you move the data to smaller LUNs to improve utilization.

A backup and recovery assessment can be as simple as looking at all of the exisiting jobs and infrastructure and match the backups to the requirements previously collected. Maybe you really dont need to keep 1 year of data online, maybe you only need one backup a week, etc.
Normally it is good to talk about archive at this stage. This applies from structured to unstructured archive. Database archiving or moving files to an archive platform based on age, type, etc. This is harder than it sounds because of the available tools. Some of them slow down the system during high I/O's (DX).

This all becomes easier if you have a storage virtualization engine on top. Being able to move the data in the background to different tiers, increase utilization and auto0-archive makes it all the more easy.

Thursday, December 4, 2008

Social Media and Text Mining

As I am doing a text mining project for a large pharma, I am realizing that our current scope of just using their internal Voice of Customer data is not enough. As the world changes and more people Blog or post comments online, we need to capture that, mine it and bring it into our analysis.

For this reason, I am going to push harder that we include blogs to scrape as part of this mining project

The new Way to Learn in the Enterprise

The current way to learn at most organizations is to read some presentations, maybe some videos, answer questions at the end and you are done.

That is old news.

First of all this should all be centralized in a system. Based on your profile (area of work, preferences, past and future training, assigned learning, skillset, etc. A list of training areas should be assigned. Right, so that is being done.

The cherry on top is user-generated (or modified) learning. What if you are watching a video and you have something to add to the training, a question, a comment, etc. There should be private discussions, comments, the ability to upload information to each training area. This allows for the training to be modified on the fly. It also raises the consciousness. Learning should always be collaborative.

If everyone is modifying content, tracking issues and questions, the content will improve and people will learn more. Add rewards and it will be fun too.

Social Networking in the Enterprise

This is another thing that will have to happen soon. If not soon, then when the web 2.0 generation starts running companies. I saw this company has a new release on some enterprise social networking software.

http://www.saba.com/products/social/

Pretty cool stuff. Again, I think one main goal of this is to raise the collective consciousness. One person by them self isn't smart (exception typing). That is why we have teams. To be able to connect everyone's thoughts, allow them to communicate, record this communication, you are essentially creating a collective consciousness.

Not just that, but just facilitating different types of communication is important. Everyone uses email, phone and meets for real at most large companies. Email is going to be used less and less, but it will have its place for stuff that isnt interactive and is more formal.

User created content, interacting with user created content. Now we just have to mine this content with a text mining tool to get relevance.

Best IT Nerd Corporate Shoe


I love looking at ITers shoes when they are stuck in the corporate environment. I thought I would share mine. I bought them from a vintage store so they are like 20 years old, but you have def. seen the updated version on some IT folks.

Document and text Summarization

Text summarization software processes and summarizes the document in the time it would take the user to read the first paragraph. The goal is to reduce the length and detail of a document while retaining its main points and overall meaning.

This can be done by doing some natural language understanding which generates sentences through knowledge to represent the text. It can be done simply by extracting some key sentences from a document or it can simply do keyword summaries.

None of these work.

Try MS Word. It has a summarization option. They only way to get it to work is to design your document in a way that it is obvious. For example everything under executive summary will be tagged as part of the summarization.

Twitter in the enterprise

Even though I rarely use twitter (none of my friends really do and I am tired of making my dog) I think that there will be no way for companies not to use it. If you look at communication from top to bottom it has its place. You can always know what people are doing, you know how they are feeling, you know if they are on vacation, you can post questions and answers to groups of people.

It simply fills a gap in communication. Maybe it wont be Twitter, maybe it will be Outlook or IM status, but somehow there will be organizations that will use this more and more. Watch.

Tiered Storage

I have setup so many tiered storage models for different companies that it has become a one week activity. Why are tiered storage models important? They tell you that they save money, align the data with the appropriate storage based on business requirements (which change over time) and get the best use out of your resources.

In reality they cost money. At least from my experience. I think they are a money maker. I have gone in and created these models, identified gaps and guess what.....those gaps get filled with technology that is purchased. I need to be selling this stuff too. The storage organizations created them to make money.

I do though, think they are necessary. In the past everyone threw the best disk at everything, costing a lot of money. Most applications can run on lowered tiered storage like Clariion. I have moved entire data centers to Clariion (some tiered storage strategy that was) with no issues.

One problem with tiered storage models is not creating the application which takes in requirements and auto-assigns. This is necessary. Guess what, I do that to. I can drop in a company for a short term gig and save them money. Or cost.

Information Extraction

One issue with current text mining is that when key words or terms are extracted to be used for classification or clustering (later in the process), they must be relevant. If this step fails the whole process fails.

There needs to be a network of text mining implementations that communicate to create a larger, more advanced system. Everybody is doing the same thing.

Text mining stop words

Stop Words in text mining are words ignored in a query because they are so commonly used that they can't contribute to relevancy. I think this is a key indicator of why texting, twitter, and sometimes blogging is much more efficient. Essentially we get rid of a lot of the stop words to create more precise statements. One issue with this is that you need to really understand the language very well as well as the person you are communicating with.

Instead of: John went to the store to buy a can of peas it would be: John went store buy can peas. There are much better examples, but that just came off first. Hmmm...

EMC RecoverPoint and F5 Load balancers = 5 minute failover across the world

The new version (3) of RecoverPoint started as a pain in the ass. I did the upgrade myself and the RecoverPoint appliances we were using didnt have enough memory. It took many crashes (and 1 server outage) before we figured this out. I figured it out by running a TOP and watching the memory.

After upgrading the memory these things work.

The product itself is a great DR solution. It creates point in time snaps of SQL, Exchange, etc.

My solution is replicating our Virtual Machines boot volumes. This means that if a server in the hosting site goes down, we can immediately power up the target server and it has all of the exact same data because we are also replicating the boot volume.

You may ask, ya, but it if it has the same IP address, wont there be an issue? My solution to that is to use F5 load balancers (global). This allows clients to point to the F5 and the F5 to understand which side is up, local or remote. Seemless failover in under 5 minutes across the world. I bet you cant do that.

The new RecoverPoint GUI is very nice, but the order in which you do things is much different. Basically backwards.

I will say that in any company that I have to do DR Strategy in the future, RecoverPoint will definitely on the top of my list. Storage agnostic and easy to use.

Backup and Recovery Setup

I recently designed and implemented a backup/recover solution for a small hedge fund. They were using Veritas Backup Exec, a Quantum P50 DLT (or something). Brick level backups were being done on the Exchange server and everything was being backed up over the LAN.

As everybody does, our backup and restore times were not meeting SLAs. I performed an analysis and ended up bringing in EMC Networker, Data Domain, EMC Replication Manager, Kroll On-track and ESX Ranger.

This solution allowed me to backup large data sets on the SAN using Replication Manager, it allowed for fast backups and restores with the ability to store around 50 TB of data on 4 TB of disk using Data Domain and the EMC Networker software, although harder to learn, is enterprise class.

Backing up virtual machines is cake. ESX Ranger snaps them off and puts them on Data Domain.

All Data Domain is replicated to our DR site.

No more slow brick level backups of Exchange. By using Replication Manager to clone to the backup server, we backup the Exchange database files in their interity. Replication Manager also truncates the logs when the backup completes.

What if we want to restore one mailbox? Using Kroll Ontrack we can look into the EDB files, choose the mailbox we want to restore and restore it to a PST.

SLAs are met, restores are quick and now I sit back on move on to the next project.

Text mining and mapping the human brain

Text mining is a bottom up approach. The ability for a system to understand language will be much easier as we begin to map how the human brain works, neural networks, artificial intelligence, etc.

This may sound obvious as everything will be easier to understand, but at some point I think the current approach to text mining will meet the research being done on artificial intelligence/neural networks/mapping the human brain.

I just hope it happens soon as the current software is very basic in how it understands language and needs a lot of teaching.

Text mining Clustering

The major text mining vendors cant seem to do what we need. From what we see they are concentrated on doing mainly Voice of customer. Right now we are looking at clarabridge. They have the best interface I have seen yet, but trying to find previously unknown information isnt easy.

Classifying the text is done very well, but I dont want to have to build the meta data model. I would like clustering to do a better job and do it for me.

It has yet to be seen.

What is text mining?

According to Wikipedia: Text mining, sometimes alternately referred to as text data mining, roughly equivalent to text analytics, refers generally to the process of deriving high-quality information from text. High-quality information is typically derived through the dividing of patterns and trends through means such as statistical pattern learning. Text mining usually involves the process of structuring the input text (usually parsing, along with the addition of some derived linguistic features and the removal of others, and subsequent insertion into a database), deriving patterns within the structured data, and finally evaluation and interpretation of the output. 'High quality' in text mining usually refers to some combination of relevance, novelty, and interestingness. Typical text mining tasks include text categorization, text clustering, concept/entity extraction, production of granular taxonomies, sentiment analysis, document summarization, and entity relation modeling (i.e., learning relations between named entities).